From the source into the Connectome

In the SWITCH Innovation Lab ‘Linked Data Pipeline’, Laura Rettig of eXascale Infolab is building an interface between research data repositories and the Connectome prototype.

Text: Cornelia Puhze, published on 02.07.2020

What is the problem the Innovation Lab is trying to solve?

Laura Rettig: We’re testing how to consolidate metadata from research databases in a structured way. To do that, we’re building a pipeline that integrates unstructured and structured datasets from their sources into the Research Data Connectome and puts them into a structured form.

What does linked data mean, and what role does this concept play in the Connectome?

Laura Rettig: Linked data is connected data which, when depicted on a graph, shows which data is related and how. In the Connectome prototype, we’re only concentrating on the links between metadata to start with, so attributes like the name of the author, professorship, methodology, title, research question and technical attributes. That alone is already very complex. In the prototype we aren’t yet dealing with the content of the datasets, e.g. interviews with individuals in a social science study, or city maps in a historical research paper.

Who determines what metadata is relevant and worth including in the Connectome prototype, and how?

Laura Rettig: We – that is, all of the partners in the pilot phase of the Connectome – draw on existing standards. We use the general schema from schema.org, which is trying to broadly depict the world. We expand these predefined dataset classes from schema.org with additional attributes. So the schema for the prototype is a hybrid.

Basically, it’s a question of defining the maximum metadata, meaning everything that’s going to be represented in the schema. This also helps us determine which functions can be developed in the Connectome. These are decisions we make together and as generally as possible because no one person can possibly bring in enough expertise and perspectives. In parallel with the Innovation Lab, interviews are held with researchers to find out what they need, and we develop use cases based on that.

How does cooperation with the pilot partners work?

Laura Rettig: We work very closely together. The research data comes from the repositories held by the partners FORS and DaSCH, so it’s from the social sciences and humanities. We have a very good infrastructure from life sciences, the EPFL Blue Brain Nexus.

In the last few months, I’ve swapped ideas about attributes and data formats directly with FORS and DaSCH as well as researchers in their fields.

Is the plan to be able to automatically read in data from research projects in the future?

Laura Rettig: Generating metadata automatically from datasets will remain difficult, I think. And it’s also not unreasonable to expect researchers to provide the metadata for their research. We probably need more incentives and rules from research funding bodies, but also some rethinking in the research community.

It would also be interesting to be able to automatically analyse the data over the long term. Meta-analysis is particularly popular in the social sciences. It could be very useful for researchers if data could be automatically aggregated. That would be one possible direction for the near future.

Will the Connectome prototype be able to do that already?

Laura Rettig: No, at the moment we’re concentrating on the searchability of datasets. Specifically: a researcher may wonder, what existing research is there in my topic, in this area, or who’s working on something similar to me and with what results? The more attributes datasets have, the easier it is to find and reuse them. That’s why it’s also important to define quality metrics for the metadata.

We tested ways of implementing quality metrics in a two-day sprint at the Research Data Alliance (RDA) Hackathon.

What exactly did you develop at the RDA Hackathon?

Laura Rettig: We put semi-structured data – in this case data management plans (DMPs) submitted to funding bodies in PDF format – into a structured form; we built a pipeline, in other words. First, we defined what metadata the recipients consider important. This metadata was then extracted from the DMPs by text analysis, which allowed us to identify which topics are covered and which aren’t. This allows you to sift data management plans much more quickly and focus on checking for gaps.

What are the next steps for finalising the Innovation Lab?

Laura Rettig: Next, we’ll be integrating the data into the EPFL data management platform Blue Brain Nexus. We’ll add further datasets from OpenAIRE in addition to the data from FORS and DaSCH, for load testing as well. We want to investigate the scalability and performance. For example, how easy is it to find something when you have a lot of data with the same keyword? Does a keyword search even make sense in this case? Or in more general terms: how do you find relevant data?

What do you value most about your work in the Innovation Lab?

Laura Rettig: I find exchanging ideas with the pilot partners and all the many experts in different disciplines very rewarding. As a researcher, I also think it’s really important and also sensible that we have research data centrally linked in one place in the future, and not lying around on some unknown hard disks that will be destroyed at some point. These are all resources lying fallow that we might be able to use to solve a lot of problems. Plus, a lot of research data was financed by public funds and shouldn’t be lost because it’s part of the research heritage of future generations.

More about SWITCH Innovation Labs

Autorin-L

Laura Rettig

Laura Rettig is a Ph.D. student at the eXascale Infolab, University of Fribourg, Switzerland under the supervision of Philippe Cudré-Mauroux. Her research areas of interest are big data infrastructures and computational linguistics, in particular for social, semantic, and scientific data. She holds a Master's degree in Computer Science from the University of Fribourg, with a thesis written on big data streaming using real-world telecommunication data at Swisscom.

The Research Data Connectome

Other articles