In the SWITCH Innovation Lab ‘Linked Data Pipeline’, Laura Rettig of eXascale Infolab is building an interface between research data repositories and the Connectome prototype.
Laura Rettig: We’re testing how to consolidate metadata from research databases in a structured way. To do that, we’re building a pipeline that integrates unstructured and structured datasets from their sources into the Research Data Connectome and puts them into a structured form.
Laura Rettig: Linked data is connected data which, when depicted on a graph, shows which data is related and how. In the Connectome prototype, we’re only concentrating on the links between metadata to start with, so attributes like the name of the author, professorship, methodology, title, research question and technical attributes. That alone is already very complex. In the prototype we aren’t yet dealing with the content of the datasets, e.g. interviews with individuals in a social science study, or city maps in a historical research paper.
Laura Rettig: We – that is, all of the partners in the pilot phase of the Connectome – draw on existing standards. We use the general schema from schema.org, which is trying to broadly depict the world. We expand these predefined dataset classes from schema.org with additional attributes. So the schema for the prototype is a hybrid.
Basically, it’s a question of defining the maximum metadata, meaning everything that’s going to be represented in the schema. This also helps us determine which functions can be developed in the Connectome. These are decisions we make together and as generally as possible because no one person can possibly bring in enough expertise and perspectives. In parallel with the Innovation Lab, interviews are held with researchers to find out what they need, and we develop use cases based on that.
Laura Rettig: We work very closely together. The research data comes from the repositories held by the partners FORS and DaSCH, so it’s from the social sciences and humanities. We have a very good infrastructure from life sciences, the EPFL Blue Brain Nexus.
In the last few months, I’ve swapped ideas about attributes and data formats directly with FORS and DaSCH as well as researchers in their fields.
Laura Rettig: Generating metadata automatically from datasets will remain difficult, I think. And it’s also not unreasonable to expect researchers to provide the metadata for their research. We probably need more incentives and rules from research funding bodies, but also some rethinking in the research community.
It would also be interesting to be able to automatically analyse the data over the long term. Meta-analysis is particularly popular in the social sciences. It could be very useful for researchers if data could be automatically aggregated. That would be one possible direction for the near future.
Laura Rettig: No, at the moment we’re concentrating on the searchability of datasets. Specifically: a researcher may wonder, what existing research is there in my topic, in this area, or who’s working on something similar to me and with what results? The more attributes datasets have, the easier it is to find and reuse them. That’s why it’s also important to define quality metrics for the metadata.
Laura Rettig: We put semi-structured data – in this case data management plans (DMPs) submitted to funding bodies in PDF format – into a structured form; we built a pipeline, in other words. First, we defined what metadata the recipients consider important. This metadata was then extracted from the DMPs by text analysis, which allowed us to identify which topics are covered and which aren’t. This allows you to sift data management plans much more quickly and focus on checking for gaps.
Laura Rettig: Next, we’ll be integrating the data into the EPFL data management platform Blue Brain Nexus. We’ll add further datasets from OpenAIRE in addition to the data from FORS and DaSCH, for load testing as well. We want to investigate the scalability and performance. For example, how easy is it to find something when you have a lot of data with the same keyword? Does a keyword search even make sense in this case? Or in more general terms: how do you find relevant data?
Laura Rettig: I find exchanging ideas with the pilot partners and all the many experts in different disciplines very rewarding. As a researcher, I also think it’s really important and also sensible that we have research data centrally linked in one place in the future, and not lying around on some unknown hard disks that will be destroyed at some point. These are all resources lying fallow that we might be able to use to solve a lot of problems. Plus, a lot of research data was financed by public funds and shouldn’t be lost because it’s part of the research heritage of future generations.