Unkown Data – Mining and Consolidating Research Dataset Metadata on the Web
Research data is essential for science, but many datasets are hidden on websites and in small repositories or are difficult to find due to insufficient metadata. Only a fraction of researchers proactively make dataset metadata available in public portals and curating them is costly.
As part of the “Unkown Data Project”, an infrastructure has now been created to simplify the reanalysis of research data and the replication of research results. Furthermore, the origin of data was made more traceable and data sets were made visible that could not previously be found in public collections.
- The objectives described above were achieved through various procedures and approaches:
- The use of citations from scientific papers and websites to find metadata on datasets
- Discovering datasets and their context by crawling relevant websites
- Consolidating metadata by linking it with information from domain-specific databases
- Ensuring metadata quality by establishing a discipline-specific curation process
- Ensuring the long-term availability of original sources by archiving relevant websites
Extracting metadata about research data from websites and publications is a novel approach that increases the visibility of “long tail” datasets while providing crucial insights into the actual use and impact of (known) research data. Long-tail datasets are datasets that can only be found using specific search terms.
Two disciplines, computer science and the social sciences, benefit centrally from the project results through use case pilots. The DBLP bibliography and the GESIS portals are among the most respected and widely used metadata collections in their respective fields. Both are used by many other search engines such as Google Dataset Search and CESSDA. Unknown Data has significantly improved the effectiveness and efficiency of researchers searching for data by creating, for the first time in computer science, a centralized and comprehensive collection of metadata about research data and fundamentally improving the quality and quantity of dataset metadata in the social sciences.
Dataset citations extracted from websites or publications enable an assessment of the impact of datasets - a crucial feature for evaluating their usefulness and reuse.
All collected metadata will be made permanently publicly available as Linked Open Data and via REST APIs to make research data discoverable, accessible, interoperable and reusable for both researchers and machines (according to the FAIR Data Principles). All software is made available as open source and the methods developed can be adapted to other disciplines.
The project was funded by the German Research Foundation (DFG) from December 2021 to November 2024 and was developed in cooperation with the Internet Archive and the Consortium of European Social Science Data Archives (CESSDA).