Love Data Week - Day 2 - Data documentation - BigData for Smart Cities

Welcome to the second day of the Love Data Week 2018!

Research data at DTU:


"Responsible research data management is a core task to ensure integrity, transparency and honesty in research. Whether you are developing new research ideas or protecting your results against accusations of scientific misconduct, you need to be able to find and understand the underlying data. This is a big challenge if the data are not well documented and preserved."

Merian Haugwitz - Senior Officer on Research Integrity at DTU.
 

Stories about data:

 
How to make a dinosaur disappear in the basement?

Something that researchers are often sloppy at is properly documenting data, mainly because it can take time. Scientists tend to rely on their good memory instead of writing a clear description of an experiment, data analysis or the data itself, convinced that they would remember all the relevant information.

It is also very often the case that, once a researcher leaves a group, somebody else would like to continue building upon previous projects or maybe save some time and avoid repeating failed experiments or analyses. However, if the data is not well documented, the re-use of data will be hindered and resources are wasted.

In an extreme case, lack of data documentation can even result in the loss of a whole dinosaur skeleton in the basement of a museum! Read the experience of Dr. David Evan, associate Curator of Vertebrate Palaeontology from the Royal Ontario Museum (ROM) in Canada, who was looking for Barosaurus skeleton for an exhibition without knowing that he had one below his feet!
PilRead the hole storry on the ROM website
 

We are data:  


CITIES project – managing Big Data for Smart Cities

DTU is part of the six-years project CITIES (Center for It-Intelligent Energy Systems in cities), which aims to contribute to Denmark’s goals on becoming a 100% fossil-free country by 2050. This project uses Denmark as a living lab collecting data from different sources related to energy provision and consumption. With the smart use of data, models and control optimization, CITIES focuses on producing efficient and flexible energy systems.
PilRead more about the project on the website
 
We talked to Xiufeng Liu, researcher at DTU Management Engineering, who is responsible for the data management workflows in the project to hear about his tasks as a data manager and about the challenges in a project where big amount of data are collected every single minute:

Tell me a bit about your research and you role in the CITIES project…
"My research focuses mainly on the analysis of smart data. In this project in particular, I am responsible for the data management, which involves setting up and administrating an IT-solution for data processing and analysis in our private cloud platform, and make it available to the researchers."

What kind of data is collected and what is used for in practice?
"In the CITIES project, we have collected data including IoT (Internet of Things) data, energy consumption data, socio-economic data, and public registry data (BBR). The data are mainly used for research purposes, for example, data analysis and creation of models to optimize the energy consumption of buildings."

What is the main challenge with regards to the data management in this project?
"There are several challenges. The first one is how to manage the data effectively and to efficiently use the current data technologies. Today, there are many data technologies available, and most of them are open source. I have to select appropriate tools that can help me more efficiently to manage the data. Personally, I prefer using notebook-based data technologies at my work, such as Apache Zeppelin and Jupyter, which can greatly improve work efficiency. I also have to consider that the selected tools and platforms are easy for other researchers to access and use the data.

In addition, this project collects different data from different sources. Often, these data have different issues like inconsistency in formats and quality, etc. I have to invest a lot of time in data cleansing, before providing it to the researcher. This is also my main task in this project. I set up an IT-solution in our cloud platform to automate this process, but for specific data sets, I still have to spend the time to write the programs for cleansing the data.

The other challenge is that we are confronted with data privacy issues. Many of our data are from energy companies, and the data contain sensitive information, such as energy consumption data and household data. The energy companies always want to anonymize the data as much as possible, for example, publishing aggregated consumption data from data users. But, this of course, will compromise data usability as it will contain less information to work with. Therefore, we have to find the balance between data usability (e.g. what are the research questions to answer) and the level of anonymization, which is not always an easy task."

Xiufeng Liu, DTU Management Engineering, xiuli@dtu.dk, ORCID: 0000-0001-5133-6688