Love Data Week - Day 5 - Open Data in the Bioscience field – The data behind the music

fredag 16 feb 18

af Paml

Share your data story

If you want to share your data story – good or bad – you can tweet it using the hashtags #dtulibrary and #lovedata18 or you can leave us a post on the Facebook page of the library @DTUBibliotek.

We are looking forward to hearing about the research data you work with, data you have lost, data you have published or your favorite database ...

This week is all about data!

We have arrived to the end of the Love Data Week 2018. We will be back next year with more data stories to share.

We would like to thank those of you who followed us during this week, the researchers and colleagues from the library who contributed to making this week possible and of course the international planning committee of the Love Data Week for providing tons of inspiration material. You still have time to share your data story ...

Research data at DTU:

“FAIR is wisdom. The FAIR data principles supporting the sharing of research data is the pathway to multiply the value of datasets by enabling multiple applications; it condenses the wisdom I gained in 34 years of data scientist career.”

Anna Maria Sempreviva
senior researcher DTU Wind Energy and FAIR data ambassador

Stories about data:

Open Data in the Bioscience field
Many publishers have implemented policies on the availability of data, material and methods in order to improve the research reproducibility of scientific results. See for example the policy of the Nature publishing group. These policies are forcing researchers to adapt and improve their workflows on how they collect and document their data in a way that facilitates compliance with these requirements.

We talked to Markus Herrgard, professor at DTU Biosustain, who has used Jupyter notebook and GitHub to make the data underpinning the publications of his group openly available.

If you want to see a nice example on how his group has used these tools, check the publication “Multi-scale exploration of the technical, economic, and environmental dimensions of bio-based chemical production” where the software to produce the metabolic models and the code used for all the analyses and figures reported in the publication was published in GitHub.

We asked Markus about the benefits and the challenges of choosing an open data model in his research area.

What is your motivation to publish the data underpinning your publication as openly available data?
Mainly two things. First, we try making our research more useful for others. What is the point if nobody can ever use our results or try out the methods we develop? The second one is that it is simply easier. In addition to insuring the reproducibility of our research, once we have decided to make our data openly available, it becomes very straightforward for us as well as other people to reuse the data. We also use these documented data for teaching, for example.

Does it mean that it is a tradition in your group to make the data from publications openly available?
Yes, it is. We are trying to be stricter and provide all the data including the raw data, the processed data, the scripts to process the data, the scripts to make the figures of the publication, everything. It is overall easier. For example, if new students join the project, they can reproduce all the results right away and understand how the analyses were done. They will not need to talk to the former PhD student or postdoc involved in the project.

We do this routinely. Most of the journals encourage making the data available. The big journals in our field like Nature Biotechnology, have official guidelines that are very stringent. They do not mandate it, but they encourage you to share all the source code in an executable form. This is where we use Jupyter notebook that makes it easier to understand the whole process because you can add text explaining each step, you can add the code, figures, etc. - it is not just a piece of code.

Do you use Jupyter notebook for the daily documentation of the work, or is it something you specifically use when you want to make the data publicly available?
Yes, I think most of the work is actually done through the notebooks. Of course, if you code outside Python, like C++ code, you document it differently. But we always encourage the use of the notebooks and we also use it in our teaching so students can understand all the methods we develop.

Until now, we have talked about work in computational bioscience. Is it also easy to make the data derived from the experimental work openly available?
In principle, yes. The problem is if you want to share the very raw data; this is possible only in the experimental areas where you have standard formats like genomics and sequencing data. There, you find standard formats to share data that everybody agrees with. In proteomics, it is more or less the same situation, but already it starts being more complicated. For other types of experiments, like measuring growth rate, there are no standard formats to share the data. It means that you have to decide if you will share the images collected with your equipment or the plate reader measurements, etc. In those cases, we do not share the raw data because it is a huge amount of data and there is no standard format to share it. But whenever standard formats for sharing exist, and the data are relevant for re-use, we try to share them to the largest extent that we can.

Do you think that making data openly available increases the impact of your research?
Yes definitively. We have a publication from 2014, which is a good example. We were trying to develop an algorithm for a given task, and the postdoc who was doing the research realized that our method was not really a good one compared to what had been reported in several publications. He decided to code some of the methods described in previous publications, which actually did not publish the code behind their algorithms. He also tested whether these reported algorithms worked in predicting what was observed experimentally. And he demonstrated that none of the published methods really worked well when you assessed them objectively. We published the results and we provided all the code for all the methods in a consistent manner, the code for the test and the experimental results. I think the paper is the most highly cited publication of our group since I joined DTU because whenever somebody wants to publish a new algorithm for that particular task, the reviewers will ask them to run it through the test we developed.

That is an example where providing the data and the code underpinning a publication made it possible for other people to verify their results. Instead of just citing this publication, as a sort of assessment paper, it becomes something that people can use in their own research, and I think that is always very good.

At DTU Biosustain, the research groups often have collaborations with industry. How do you balance an open data model and the collaboration with the commercial partners?
It is very important to consider the publication strategy up front and make it part of the agreements with the commercial partners. Of course, we have projects where the data cannot be released and in these cases, we do not aim to publish the results. In fact, we protect these data very carefully and only few people can access them. That is what we call commissioned research and it is done very differently from a normal research project.

If we aspire to publish our results, it is important to make the agreements up front with the industrial partners whatever the model is, if it is an EU project or directly funded by a company. Most of the publishers and funders require making the data openly available so there is little to negotiate if you want to publish your results or if PhD students and postdocs are involved in the project. Making data openly available is our preferred model. If the commercial partner does not agree, then we need to do the project and agreements differently. But that is a discussion that you need to have up front and I have to say that most of the commercial partners are fine with it. They just want to have clarity in the agreements.

Markus Herrgard, Professor, DTU Biosustain, herrgard@biosustain.dtu.dk, ORCID: 0000-0003-2377-9929

Telling stories with data:

The data behind the music
To finalize the Love Data Week, we want to share with you a very inspirational story where data has been used in a creative way beyond research. We all agree about the power of data when used for research, and I guess most will also agree that music is a very powerful demonstration of art. How about combining them?

The programmer and visual artist Brian Foo has an amazing website called Data-Driven DJ where he explores how to make music based on data.

“My goal is to explore new experiences around data consumption beyond the written and visual forms by taking advantage of music's temporal nature and capacity to alter one's mood.”

Brian Foo - Data-Driven DJ website

The video we want to share with you today is one example of Foo’s work. The song generates a musical sequence using EEG brain wave data of a patient with epilepsy. The song examines the periods before, during, and after a seizure. With this work, Brian Foo hopes to provide the listener with a more empathetic understanding of the brain’s neural activity during a seizure.

You can find all Foo’s musical work on the website where you will also find the documentation of the creative process and the custom software written by him. All openly available.