About the Dataset Catalog
The Dataset Catalog is a freely available catalog of biomedical datasets available from various repositories. Adhering to Findable, Accessible, Interoperable, and Reusable (FAIR) data management principles, the Dataset Catalog allows users to search for specific biomedical datasets and navigate among biomedical datasets by linking descriptive data.
Descriptive dataset metadata from the various repositories are converted to the DATMM linked data standard and added to the Dataset Catalog. By harmonizing and standardizing the structure of descriptive data, the Dataset Catalog facilitates discovery and reuse of biomedical datasets and will eventually make it easier to find and connect datasets to related objects on the Semantic Web.
The Dataset Catalog is designed to:
- Allow researchers to search, discover, and retrieve biomedical datasets from many repositories.
- Standardize metadata for biomedical datasets into a linked data model that will facilitate the creation and discovery of related information.
- Push linkable metadata about biomedical datasets to the Semantic Web to facilitate data discovery in a broader environment and create relationships that might otherwise not be apparent.
Governance
Dataset Catalog Content and Inclusion Policy
The Dataset Catalog is a freely available catalog of biomedical datasets available from various repositories managed and hosted by government agencies, academic institutions, scholarly societies, and non-governmental organizations. The Dataset Catalog is a web-based resource that provides information to help users discover and access biomedical datasets of interest at their host repository. The Dataset Catalog does not include any datasets; it is a collection of bibliographic metadata that describes datasets available from host dataset repositories. The Dataset Catalog provides links to datasets at the host repository, where the datasets themselves can be directly accessed.
The Dataset Catalog provides a standardized description of biomedical dataset information from relevant dataset repositories and some datasets can be available from multiple repositories.
The Dataset Catalog is not a dataset repository and does not hold datasets that are available in host repositories. Aligned with NLM Collection Development Guidelines, information regarding datasets that are removed from host repositories will continue to be discoverable in Dataset Catalog and contain a message that states the “Dataset is no longer available. For more information contact the host repository directly.”
Biomedical data repositories are reviewed by NLM for potential and continued inclusion in the Dataset Catalog utilizing the criteria specified below.
-
Inclusion Criteria
- Repository content aligns with NLM Collection Development Guidelines and scope of the NLM Collection
-
For beta Dataset Catalog development included repositories were:
- NLM managed and maintained;
- part of the Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) maintained list of NIH-supported data repositories;
- part of the NIH Generalist Repository Ecosystem Initiative (GREI); and/or
- an academic research data repository that adheres to NIH desirable repository characteristics.
- Repository has a publicly available and understandable holdings policy providing sufficient information about how its dataset collection is managed to determine its appropriateness for inclusion the Dataset Catalog
-
Repository content passes the Dataset Catalog technical review, which includes:
-
Data and datasets can be freely and easily accessed by satisfying at least one of the following:
- There are no limitations on access to and re-use of data and datasets
- No registration required or free registration is available for downloading of data
- Provides a programmatic access point for data download and includes documentation about its usage (such as an application programming interface (API))
- Provides the ability to find and select a subset of biomedical datasets from a broad spectrum of datasets
- Metadata is sufficiently descriptive for inclusion and can be mapped to the NLM Dataset Metadata Model (DATMM)
- Has a responsive point of contact
-
Data and datasets can be freely and easily accessed by satisfying at least one of the following:
-
De-selection Criteria
- Repository no longer wants to be included in the Dataset Catalog
- Content is no longer within the scope of the NLM Collection
- Data model is no longer compatible or cannot be converted to DATMM format
NLM will regularly review included repositories to ensure consistency with policies and best practices.If the repository fails to meet compliance criteria, the Dataset Catalog team will work with repository managing organization to address gaps in compliance; otherwise, the repository may be deselected.
Content in the Dataset Catalog may be collected from repositories managed by government agencies, academic institutions, scholarly societies, and non-governmental organizations. NLM considers the standards specified under Dataset Catalog Content and Inclusion Policy and Dataset Repository Site Inclusion Criteria when evaluating a repository for inclusion. NLM does not review, evaluate, or judge the quality of individual datasets. The host repository managing organization is responsible for maintaining the currency of the scientific record within their repository. Questions regarding the datasets or the data they contain should be addressed to the host repository consistent with their policies.
The Dataset Catalog is being released as a beta version for the purpose of user feedback informing further development. As a beta version this means:
- Limited corpus of datasets: The number of repositories available to search are currently limited to four (as shown https://www.datasetcatalog.nlm.nih.gov/). However, pending user feedback and beta test results, more repositories may be incorporated into the Dataset Catalog.
- Refresh cycle being developed and implemented: Currently data are retrieved from each repository and mapped from the repository’s transport schema (i.e., the schema used in the delivery method, such as an API or and FTP server) to the DATMM schema. Pending user feedback and beta test results, an update cycle for each repository will be established and published.
- Export function of citations: The beta version of the Dataset Catalog only provides dataset citation information in the search results. Pending user feedback and beta test results, citations may be made available for download and re-use through popular citation management software similar to the “Send To” function available through NLM literature services.