Research Data Repositories

Photo by benjamin lehman on Unsplash

It is inevitable these days that researchers depend heavily on various kinds and sources of data. Researches usually start from one or more sources of input data, and some generate datasets that are contributed as open data, available to the public upon the publication of papers. In some cases, these open data may be in multiple versions as data being updated, or new data is added. Customisable access control may be a requirement if the research data is part of the peer-review process as the research paper. Thus, it is critical when choosing a data repository, we evaluate it against our use case and data management workflow and requirements.

While searching for a data repository solution for our projects, we find the information regarding research data repositories and open data repositories are scattered over the internet. We decided to compile a list of repositories that we found relevant to our field, perform a comparison among them.

In general, there are four categories of research data repository options:

  1. data repositories associated with academic institutes,
  2. generic open data repositories,
  3. open data registry backed by private companies, and
  4. self-hosted

There is an endless list of options for each of these categories, we have selected a few major ones in each to compare and contrast. The comparison is performed based on the following aspects that are important to us:

  • supported file format
  • file size limitation
  • total storage space limitation
  • version control
  • access control
  • data license requirements
  • cost

Academic Data Repositories

These repositories are usually set up within an academic institute to host the research publications and/or research data of research labs in the institute. While such systems are set up within the institute, some, like Dataverse at Harvard University and ICPSR at the University of Michigan, are open to host publications and data contributed by researchers from other academic institutes.

Here we compare ScholarBank@NUS, Dataverse@Harvard and ICPSR@UMich.

RepositoryAcademic InstituteCountryDisciplinary FocusSupported File FormatsFile Size LimitationTotal Storage LimitationVersion ControlUser Access ControlData LicensingExclusivityCostAdditional NotesData PortalLink to Data Submission Information
ScholarBank@NUSNational University of Singapore (NUS)SingaporeMultidisciplinaryText, Page Description, Data and Microsoft Office File Formats: .csv, .pdf, .txt, .ps, .rtf, .xml, .html, .htm, .css, .doc, .docx .ppt, .pptx, .xls, .xlsx, .latex, .tex, .zip, .tab, .por, .odt Image File Formats: .tif, .jpg, .jpg2, .gif, .png, .bmp, .psd, .ai, .eps Video File Formats: .mp1, .mp2, .mp4, .mov, .avi, .wmv, .flv Audio File Formats: .mp3, m4a, .mp4, .aif, aiff, .wav, .wma, .au, .snd, .flac Geospatial Formats: ESRI Shapefile: .dbf, .prj, .shp, .shx, .sbx, .sbn, .xml, .infUp to 1GB for each file. For big files, need to further contact scholarbank@nus.edu.sgunknownVersioning is done through file names managed by researchers3 levels of access restrictions at either individual item level or collection level: Restricted: Full-text of content are accessible only by NUS staff and students while the metadata is accessible publicly. Embargoed: Full-text is closed for a specific period of time as determined by the depositor but the metadata record is accessible publicly. Closed: Only the metadata record is accessible publicly.Must be compliant to various NUS Interlectual Property and Copyrights Policies, NUS Data and Research Data Policies. Supports Creative Commons licenses.unknownFreehttps://scholarbank.nus.edu.sg/cris/explore/publicationshttp://lib.nus.edu.sg/sb/policy-and-guidelines.html#
Dataverse@HarvardHarvard University, MassachusettsUnited StatesMultidisciplinaryAll file formats accepted (tabular, non-tabular, and compressed as a zip file bundle with file hierarchy feature to preserve directory structure)To use the browser-based upload function, file can’t exceed 2.5GB. However, Harvard Dataverse is willing to work with Harvard researchers who have larger files1TB, but Harvard Dataverse will work with Harvard researchers who have larger datasets (>1 TB)Yes, done through “Replace File” processOption to Share: draft, unpublished, and published (public) datasets. For draft and unpublished datasets, a variety of tiers of access can be assigned to different registered users.Creative Commons Licenses recommendedunknownFreehttps://dataverse.harvard.edu/dataverse/harvardhttps://datamanagement.hms.harvard.edu/share/data-repositories/harvard-dataverse
icpsr@UMichUniversity of Michigan, MichiganUnited StatesSocial SciencesGenerally all file formats are supported, as long as full metadata is attached with the data.UnknownUnknownManaged by ICPSRManaged by ICPSRSupports several approaches to managing these interests, including tailoring copyright and patent licenses, such as through Creative Commons licenses, and putting an embargo period or delayed dissemination on distribution.unknownTypically, funding provided by ICPSR’s membership and various external sponsors is used to cover the curation services for data deposited at ICPSR. For services beyond what ICPSR agrees to provide, costs depend upon the data (e.g., the number of variables, completeness of variable-level and study-level documentation, disclosure risk). Sometimes, there may be a small fee associated with accessing a restricted-use dataset, applied to data users. This varies by archive. Please contact ICPSR User Support for questions regarding this matter.ICPSR has approximately 780 member institutions. Most research-intensive academic institutions are members as are other universities and colleges who seek to support faculty and students in their research training and pursuits. ICPSR has international memberships in over 40 countries. National University of Singapore and Nanyang Technological University are members of ICPSR. ICPSR offers comprehensive data management plan. Version control and access control are defined in the data management contract signed between ICPSR and data contributors.https://www.icpsr.umich.edu/web/pages/ICPSR/index.htmlhttps://www.icpsr.umich.edu/web/pages/deposit/index.html

General Research Data Repositories

Other than the academic research repositories, there other types of general research data repositories. These are generally partnered with publishers, associated with government sectors, or backed by non-profits organisations.

In comparison to the academic institute backed research repositories, most of these repositories could offer free data deposition/maintenance services, while some take donations, or offer membership/subscription for higher allowance of storage. Below is the comparison between FigShare, Dryad, Zenodo, Open Science Framework (OSF), Pangeae and Mendeley Data.

RepositoryTypeAssociationCountryDisciplinary FocusSupported File FormatsFile Size LimitationTotal Storage LimitationVersion ControlUser Access ControlData LicensingExclusivityCostAdditional NotesData PortalLink to Data Submission Information
FigSharePrivate Knowledge PortalGeneral data repository, with strong affilication with publishersEngland & WalesMultidisciplinarySupports file formatsUpload files up to 5 GBUnlimited public space, up to 20 GB of free private spaceDOIs are minted on the uploaded data, and DOIs are versioned. If files or metadata are updated, the latest copy will always be referenced from DOIs, and previous versions can still be looked up.Access restriction controlled by IP, by group or to administrators of the institution portal.Supports a variety of customizable licenses, including the Creative Commons suite of CC-BY, CC0, CC-BY-SA, CC-BY-NA, and more.UnknownFreeFigshare offers fully functional data repository and institutional repository platforms, for the purpose of research data management for academic institutes. Figshare also collaborates with various publishers and offer seamless integrationshttps://figshare.com/browsehttps://figshare.com/features
DryadNon Profit Knowledge PortalGeneral data repository, with partnership with ZenodoUnited StatesDomains covered in OECD Fields of Science and Technology classificationWill accept file types if they are “community-accepted” format, compressed or incompressedThere is a limit of 300GB per data publication uploaded through the web interface. Larger submissions can be accepted, but the submitter needs to contact them for assistance.No known limitVersioning data is done by using the “update” link. All versions of a dataset will be accessible, but the dataset DOI will always resolve to the newest version.A “Private for Peer Review” workflow in place. When a dataset flagged private, a private, randomized URL that allows for a double-blind download of the dataset. This link can be used at the journal office during the review period or for sharing with collaborators to access the data files while the dataset is not yet published.Requires licensing terms that are incompatible with the Creative Commons Zero waiver.NoThe Dryad’s Data Publishing Charges (DPCs) base is per data submission is $120 USD. This is exempted if submitter is based at a member institution (determined by login credentials), an associated journal or publisher has an agreement with Dryad to sponsor the DPC or the submitter is based in a fee-waiver country. Data users can access the data hosted at Dryad at no cost.Dryad formed a partnership with Zenodo, a multidisciplinary repository based at CERN, in 2019. This partnership leverages each organization’s strengths: data curation at Dryad and software publication at Zenodo. Dryad stores a copy of all datasets in Zenodo for enhanced preservation services.https://datadryad.org/searchhttps://datadryad.org/stash/faq
ZenodoNon Profit Knowledge PortalGeneral repositorySwitzerlandMultidisciplinaryAll file formats acceptedTotal files size limit per record is 50GB. Higher quotas can be requested and granted on a case-by-case basisCurrently accept up to 50GB per dataset (you can have multiple datasets). There is no size limit on communities.YesUsers can choose to deposit files under open, embargoed, restricted, or closed access. For embargoed files, the user can choose the length of the embargo period, and the content will become publicly available automatically at the end of the embargo period. Users may also deposit restricted files and grant access to specific individuals.Data are accepted under a variety of licenses in order to be inclusive, but with strong advocacy towards most open licenses, in terms of visibility and credit, and offer additional services and upload quotas on such data to encourage using them, which aligns with the publications policy of the OpenAIRE initiative.UnknownFreehttps://zenodo.org/https://help.zenodo.org/, https://about.zenodo.org/policies/
Open Science Framework (OSF)Non Profit Knowledge PortalGeneral repository by Centre for Open Science (COS)United States Choices available for data location (for projects after ): - United States - Canada - Montréal - Germany - Frankfurt - Australia - SydneyMultidisciplinaryAll file formats accepted5GB/file upload limit for native OSF Storage. There is no limit imposed by OSF for the amount of storage used across add-ons connected to a given project.5 GB for private projects, 50 GB for public projects. OSF supports unlimited add-ons storage providers.OSF has built-in version control for all files stored in users' projects, can render hundreds of different file types, and allows users to directly edit plain text files (including R and Python scripts) directly in the browser.Projects, or individual components of projects, can be kept private so that only the project collaborators have access to them. Collaborators can be assigned one of the three different levels of permissions: - Read - Read+Write - Administrator (Admin)Supports a long list of license, user choose from dropdown list when set up projectUnknownFreehttps://osf.io/https://help.osf.io/hc/en-us
PangeaeNon Profit / Open AccessNAGermanyGeospatialSupport data tables as Excel or tab-delimited text files; specific formats (e.g. shape, netCDF, segy …) may be added as a zip-archive. Datasets are downloadable as text/ASCII files, tab-delimited, with a header for meta information (ending is .tab, but you can open it in any text editor or in Excel just as .txt files). Large files may be available as binary objects (e.g. seismic data, models) or other formats that follow ISO standards (e.g. images, films).UnknownunknownAs long as your dataset is in the status “in review” or “registration in progress”, changes to the data are possible. Once your dataset is published and the DOI is registered, changes to the dataset are no longer possible. If you have found errors in your published dataset, please open a new Data submission with the corrected data and state that this is a new version of your data. We will put access restrictions on the erroneous dataset and link it to the new version. Normally, a small comment (explanation) is added to the old version explaining what was wrong.UnknownBy default, data are made available under a Creative Commons licenses. CC-BY is the most commonly used license, but there are others to choose from.UnknownThe basic operation is covered by public funding, but in order ensure a high quality in processing and archiving new data, PANGAEA receives additional funds. In case that data are submitted as part of a project for which funding is available for publication, PANGAEA would appreciate a financial contribution of 500.– € (net) for a data submission (e.g. as part of the costs for Open Access publications at the DFG). Other forms of funded collaborations can be negotiated. Please contact us for further information and invoicing.https://www.pangaea.de/https://wiki.pangaea.de/wiki/FAQ
Mendeley DataMedia PublishingElsevier PressNetherlandsMultidisciplinaryGenerally all file formats supported, see here for detailed preferencesUp to 10GBUp to a maximum of 10GB per dataset generally. With institution subscription, you may be allowed to create datasets up to a maximum size of 100GB, depending on the storage agreement that your institution has.YesEach draft dataset has a share link which you can copy to send to collaborators; they’ll be able to access the dataset metadata and files prior to publishing. When publishing a dataset, a user may choose to defer the date at which the data becomes available (for example, so that it is available at the same time as an associated article).Supports a long list of licenses, with the default being Creative Commons Zero (CC0).UnknownFreehttps://data.mendeley.com/https://data.mendeley.com/faq

Open Data Repositories backed by private companies

Many private companies have also showed support in the Open Data Initiative and offer hosting and registry of various open data. The terms and conditions for these data repositories are not usually publicly accessible, and need to be negotiated with the company who backed them respectively.

Here are the options we looked into open data solution offered by Amazon Web Services, Google Cloud, and Microsoft.

RepositoryTypeCompanyCountryDisciplinary FocusData PortalAdditional Information
Open Data Registry at AWSPrivateAmazon Web ServicesUnited StatesMultidisciplinaryhttps://registry.opendata.aws/https://aws.amazon.com/opendata/open-data-sponsorship-program/terms/
Google Cloud Public DatasetsPrivateGoogle Cloud PlatformUnited StatesMultidisciplinaryhttps://console.cloud.google.com/marketplace/browse;page=1?filter=solution-type:dataset&_ga=2.140140403.1155565552.1586157164-763308365.1585890675https://cloud.google.com/marketplace/docs/partners/datasets
Microsoft Open DataPrivateMicrosoftUnited StatesMultidisciplinaryhttps://msropendata.com/https://msropendata.com/faq https://msropendata.com/about

Self-Hosting

Lastly, there is always the option to self-host the research data. This option may offer more flexibility, freedom and ease to update the dataset as we wish, but it requires the development and maintenance of a data access portal, and technical maintenance of a web server, in addition to research data management. Depending on the simplicity of the data access portal, advanced features such as download statistics, version control and user access module could require effort to implement.

Summary

Although this list is not complete with all the research data repositories available, we hope this provides enough viable options for our future research and those who are looking for open data repository solutions. We also welcome other suggestions that we may have overlooked.

For those who are interested, the comparison matrix can be found in a google sheet here.

Yoong Shin Chow
Yoong Shin Chow
Research Assistant

Related