Love Data ugen - Dag 5 - Open Data i bioscience – data bag musikken

fredag 16 feb 18

af Paml

Del din datahistorie

Hvis du har lyst at dele din historie om forskningsdata - god eller dårlig, skræmmende eller opløftende - opfordrer vi dig til tweete den under hashtagene #dtulibrary og #lovedata18.

Du kan også poste den på vores Facebookside, @DTUBibliotek. Vi ser frem til at høre om dine aktuelle forskningsdata, data du har publiceret eller om din favorit database…

I denne uge handler det om data!

Så er vi nået til slutningen af Love Data Week 2018. Vi vender tilbage og deler flere historier om data næste år.

Tak til alle der har fulgt os denne uge - forskere og kolleger på biblioteket, som alle har bidraget til at markere denne uge. Også tak til den internationale planlægningskomite for Love Data Week, som har leveret tonsvis af inspirationsmateriale. Du kan stadig nå at dele din historie om data ...

Forskningsdata på DTU:

“FAIR is wisdom. The FAIR data principles supporting the sharing of research data is the pathway to multiply the value of datasets by enabling multiple applications; it condenses the wisdom I gained in 34 years of data scientist career.”

Anna Maria Sempreviva
seniorforsker, DTU Wind Energy og FAIR-dataambassadør

Historier om data:

Open Data i bioscience
For at understøtte reproducerbarheden af videnskabelige resultater har mange udgivere implementeret politikker om tilgængelighed af data, materiale og metoder; se fx Nature publishing groups politik. Disse politikker tvinger forskere til at tilpasse og forbedre deres arbejdsgange omkring indsamling og dokumentation af data, så de overholder disse krav.

Vi har talt med Markus Herrgard, professor ved DTU Biosustain, som har brugt Jupyter notebook og GitHub til at gøre de data, der ligger bag publikationerne fra hans gruppe, åbent tilgængelige.

Hvis du vil se et godt eksempel på, hvordan hans gruppe har brugt disse værktøjer, kan du læse publikationen “Multi-scale exploration of the technical, economic, and environmental dimensions of bio-based chemical production”. Her er softwaren bag de metaboliske modeller og koden, der er brugt til analyserne bag publikationen, offentliggjort i GitHub.

Vi bad Markus nævne fordele og udfordringer ved at vælge en åben datamodel indenfor hans forskningsområde.

What is your motivation to publish the data underpinning your publication as openly available data?
Mainly two things. First, we try making our research more useful for others. What is the point if nobody can ever use our results or try out the methods we develop? The second one is that it is simply easier. In addition to insuring the reproducibility of our research, once we have decided to make our data openly available, it becomes very straightforward for us as well as other people to reuse the data. We also use these documented data for teaching, for example.

Does it mean that it is a tradition in your group to make the data from publications openly available?
Yes, it is. We are trying to be stricter and provide all the data including the raw data, the processed data, the scripts to process the data, the scripts to make the figures of the publication, everything. It is overall easier. For example, if new students join the project, they can reproduce all the results right away and understand how the analyses were done. They will not need to talk to the former PhD student or postdoc involved in the project.

We do this routinely. Most of the journals encourage making the data available. The big journals in our field like Nature Biotechnology, have official guidelines that are very stringent. They do not mandate it, but they encourage you to share all the source code in an executable form. This is where we use Jupyter notebook that makes it easier to understand the whole process because you can add text explaining each step, you can add the code, figures, etc. - it is not just a piece of code.

Do you use Jupyter notebook for the daily documentation of the work, or is it something you specifically use when you want to make the data publicly available?
Yes, I think most of the work is actually done through the notebooks. Of course, if you code outside Python, like C++ code, you document it differently. But we always encourage the use of the notebooks and we also use it in our teaching so students can understand all the methods we develop.

Until now, we have talked about work in computational bioscience. Is it also easy to make the data derived from the experimental work openly available?
In principle, yes. The problem is if you want to share the very raw data; this is possible only in the experimental areas where you have standard formats like genomics and sequencing data. There, you find standard formats to share data that everybody agrees with. In proteomics, it is more or less the same situation, but already it starts being more complicated. For other types of experiments, like measuring growth rate, there are no standard formats to share the data. It means that you have to decide if you will share the images collected with your equipment or the plate reader measurements, etc. In those cases, we do not share the raw data because it is a huge amount of data and there is no standard format to share it. But whenever standard formats for sharing exist, and the data are relevant for re-use, we try to share them to the largest extent that we can.

Do you think that making data openly available increases the impact of your research?
Yes definitively. We have a publication from 2014, which is a good example. We were trying to develop an algorithm for a given task, and the postdoc who was doing the research realized that our method was not really a good one compared to what had been reported in several publications. He decided to code some of the methods described in previous publications, which actually did not publish the code behind their algorithms. He also tested whether these reported algorithms worked in predicting what was observed experimentally. And he demonstrated that none of the published methods really worked well when you assessed them objectively. We published the results and we provided all the code for all the methods in a consistent manner, the code for the test and the experimental results. I think the paper is the most highly cited publication of our group since I joined DTU because whenever somebody wants to publish a new algorithm for that particular task, the reviewers will ask them to run it through the test we developed.

That is an example where providing the data and the code underpinning a publication made it possible for other people to verify their results. Instead of just citing this publication, as a sort of assessment paper, it becomes something that people can use in their own research, and I think that is always very good.

At DTU Biosustain, the research groups often have collaborations with industry. How do you balance an open data model and the collaboration with the commercial partners?
It is very important to consider the publication strategy up front and make it part of the agreements with the commercial partners. Of course, we have projects where the data cannot be released and in these cases, we do not aim to publish the results. In fact, we protect these data very carefully and only few people can access them. That is what we call commissioned research and it is done very differently from a normal research project.

If we aspire to publish our results, it is important to make the agreements up front with the industrial partners whatever the model is, if it is an EU project or directly funded by a company. Most of the publishers and funders require making the data openly available so there is little to negotiate if you want to publish your results or if PhD students and postdocs are involved in the project. Making data openly available is our preferred model. If the commercial partner does not agree, then we need to do the project and agreements differently. But that is a discussion that you need to have up front and I have to say that most of the commercial partners are fine with it. They just want to have clarity in the agreements.

Markus Herrgard, Professor, DTU Biosustain, herrgard@biosustain.dtu.dk, ORCID: 0000-0003-2377-9929

Historier med brug af data:

Data bag musikken
Som afslutning på Love Data Week, vil vi gerne dele en særligt inspirerende historie, hvor data er brugt på en meget kreativ måde, som rækker langt ud over selve forskningen. Vi er enige om, at data giver forskningen rygstød - "power" - og de fleste er nok også enige i, at musik er en kraftfuld kunstart. Hvad med at kombinerede to?

Programør og visuel kunstner Brian Foo har en fantastisk hjemmeside, Data-Driven DJ, hvor han undersøger, hvordan man laver musik ud af data.

“My goal is to explore new experiences around data consumption beyond the written and visual forms by taking advantage of music's temporal nature and capacity to alter one's mood.”

Brian Foo - Data-Driven DJ website

Videoen, vi deler her, er et eksempel på Foos arbejde. Melodistykket er genereret ved brug af hjernebølgedata fra EEG-målinger hos en patient med epilepsi. Melodien følger perioderne før, under og efter et anfald. Med dette arbejde håber Brian Foo at give lytteren en mere empatisk forståelse af hjernens neurale aktivitet under et anfald.

Find alle Foos musikværker på hjemmesiden, hvor den kreative proces også er dokumenteret, og hvor du finder hans software. Alt er frit tilgængeligt.