Image Credit: Worldwide Universities Network

Love Data ugen - dag 2 - Datadokumentation - BigData i Smart Cities

Velkommen til anden dagen af Love Data Week 2018!

Forskningsdata på DTU:


"Responsible research data management is a core task to ensure integrity, transparency and honesty in research. Whether you are developing new research ideas or protecting your results against accusations of scientific misconduct, you need to be able to find and understand the underlying data. This is a big challenge if the data are not well documented and preserved."

Merian Haugwitz - Senior Officer on Research Integrity at DTU.
 

Historier om forskningsdata:

 
Hvordan en dinosaurus kan forsvinde i kælderen

Alt for ofte "sjusker" forskere med at dokumentere deres data fyldestgørende og korrekt, primært fordi det sluger tid. Forskere har en tendens til at stole på deres egen hukommelse i stedet for at lave en udførlig beskrivelse af eksperimentet, dataanalysen eller selve data; de er overbeviste om, at de vil huske alle relevante oplysninger.

Det sker også, at nogen gerne vil bygge videre på andres projekter fx. fra forskere, der har forladt gruppen, eller måske spare tid og undgå at gentage fejlede forsøg eller analyser. Men hvis ikke data er veldokumenterede, kan de ikke genbruges - og ressourcer bliver spildt.

I ekstreme tilfælde kan manglende datadokumentation resultere i tab af et helt dinosaurskelet i kælderen på et museum! Det oplevede Dr. David Evan, associated Curator of Vertebrate Palaeontology fra Royal Ontario Museum (ROM) i Canada, som ledte efter et Barosaurus-skelet til en udstilling uden at vide, at det lå lige under hans fødder!
PilLæs hele hans historie på museets hjemmeside
 

Vi er data:

 
CITIES-projekt – styring af Big Data i Smart Cities

DTU deltager i et seksårigt projekt, CITIES (Center for It-Intelligent Energy Systems in Cities), som skal bidrage til Danmarks mål om at blive 100% fossilfrit i 2050. Projektet bruger Danmark som levende laboratorium, et "living lab", idet der indsamles data fra de forskellige energileverandører om forbruget. Med "smart" brug af disse data bl.a. med modeller og optimering af styringen fokuserer CITIES på udviklingen af effektive og fleksible energisystemer.
PilLæse mere om projektet på hjemmesiden
 
Vi har talt med Xiufeng Liu, forsker på DTU Management Engineering. Han er ansvarlig for data management-flowet i projektet. Vi har spurgt til hans målsætning som data manager og om udfordringerne i et projekt, hvor enorme mængder af data indsamles hvert minut:

Tell me a bit about your research and you role in the CITIES project…
"My research focuses mainly on the analysis of smart data. In this project in particular, I am responsible for the data management, which involves setting up and administrating an IT-solution for data processing and analysis in our private cloud platform, and make it available to the researchers."

What kind of data is collected and what is used for in practice?
"In the CITIES project, we have collected data including IoT (Internet of Things) data, energy consumption data, socio-economic data, and public registry data (BBR). The data are mainly used for research purposes, for example, data analysis and creation of models to optimize the energy consumption of buildings."

What is the main challenge with regards to the data management in this project?
"There are several challenges. The first one is how to manage the data effectively and to efficiently use the current data technologies. Today, there are many data technologies available, and most of them are open source. I have to select appropriate tools that can help me more efficiently to manage the data. Personally, I prefer using notebook-based data technologies at my work, such as Apache Zeppelin and Jupyter, which can greatly improve work efficiency. I also have to consider that the selected tools and platforms are easy for other researchers to access and use the data.

In addition, this project collects different data from different sources. Often, these data have different issues like inconsistency in formats and quality, etc. I have to invest a lot of time in data cleansing, before providing it to the researcher. This is also my main task in this project. I set up an IT-solution in our cloud platform to automate this process, but for specific data sets, I still have to spend the time to write the programs for cleansing the data.

The other challenge is that we are confronted with data privacy issues. Many of our data are from energy companies, and the data contain sensitive information, such as energy consumption data and household data. The energy companies always want to anonymize the data as much as possible, for example, publishing aggregated consumption data from data users. But, this of course, will compromise data usability as it will contain less information to work with. Therefore, we have to find the balance between data usability (e.g. what are the research questions to answer) and the level of anonymization, which is not always an easy task."

Xiufeng Liu, DTU Management Engineering, xiuli@dtu.dk, ORCID: 0000-0001-5133-6688