Skip to main content

Good News – How data science helps us to better understand the Coronavirus pandemic

Data science can help to bring light and dark to complex processes. | Photo: Andreas Töpfer
Photo : Andreas Töpfer
Data science can help to bring light and dark to complex processes.
The pandemic caused by the Coronavirus starting back in late 2019 still affects the lives of billions of people around the globe even half a year later. In this situation, real fears of infection with the virus are just as crucial as deliberately disseminated misinformation. In order to be able to obtain independent information, publicly available data are important in fighting the virus. A group of researchers at the Hasso Plattner Institute (HPI) of the University of Potsdam is helping with this by providing findings and IT tools for graphical evaluation free of charge .

Who would have thought a few months ago that our everyday life would have been turned upside down by spring 2020? Whether social distancing rules in everyday life, wearing face masks while shopping, or emergency on-site operations at universities – all of this has been triggered by the COVID-19 pandemic. Despite the restrictions, researchers at the HPI are working on collecting up-to-date data on the worldwide spread of the coronavirus and making it available to the public.

For many Germans, the restrictions due to the pandemic came abruptly and became tangible once kindergartens and schools were closed and contact restrictions came into force in March 2020. Until then, news coverage had primarily been limited to a distant epidemic in China. Even the prestigious Robert Koch Institute had only occasionally updated the global number of cases on its website. At the beginning of 2020, people were still amazed at the unabated zeal of the Chinese, who practically built entire emergency hospitals within only a few days.

The key to success: access to latest data

The HPI has already gained experience in researching epidemics. For example, HPI researchers worked with scientists from around the world to contain the 2014 Ebola epidemic in West Africa. At that time, contact tracing proved to be a very important measure. Contacts of infected persons were isolated during the incubation period and asked daily for disease-specific symptoms. The risk of infecting other people could only be reduced by rigorously identifying contact persons and isolating them.

Contact tracing is also a key to success during the current COVID-19 pandemic. With the first cases in Germany, infected people were traced and interviewed to identify contacts persons in the preceding days. The more consistent the tracing was, the sooner it became clear that it requires many human resources to be effective. The Ebola epidemic in 2014 had already shown that there can quickly be a shortage of qualified personnel for contact tracing. Therefore, the HPI developed an app for contact tracing together with an international team of researchers and tested it on-site in Nigeria. With this app and a brief introduction, even non-healthcare personnel were able to support the contact tracing. Recently, the HPI was also involved in the development of the so-called CovApp, which supports the recording of relevant symptoms in suspected cases in Germany. Particularly in times of scarce resources in the healthcare system, the use of such digital solutions demonstrates how they can be used more effectively to enable medical professionals to focus on emergencies.

In addition to data from contact tracing another important data source is treatment data from hospitals. It contains, for example, details about the number of new, recovered, and deceased cases. However, these figures are collected locally and are only available in distributed, heterogeneous clinical IT systems. So far, no central register exists that automatically records the data without delay. Data that are collected nationwide, however, form the basis of many important decisions. For example, epidemiologists use current data on infected people in each region to assess the spread and recommend appropriate actions. The latest pandemic data can also provide information on the effectiveness of large-scale measures, such as the closing of restaurants.

Automatic integration of latest data

The HPI recognized the seriousness of the situation early on and started to identify available international data sources with case numbers for SARS-CoV-2 already in January 2020. Since the center of the epidemic was still in China at the time, the researchers focused on Chinese Internet sources. In a next step they established an in-memory database to store the number of cases worldwide. Thanks to the in-memory technology researched at the HPI, flexible real-time analyses of big data are now possible. In the database, the currently reported case numbers of infected, recovered, and deceased cases per country or region are stored together with corresponding timestamps.
So-called crawlers are used to automatically collect latest updated data . Such computer programs periodically check websites for the number of cases and automatically import them into the database once updated data were identified. In this way, the researchers were able to create a complete longitudinal database of the global number of cases, which now includes approximately 20,000 entries for almost 600 regions and countries around the globe.

Visualization is key to data interpretation

Evaluation of the data enables commenting on the current situation and analyzing data retrospectively, e.g. to identify trends in individual countries or regions. Software systems are used that support the exploration of big data with interactive visualizations. Fig. 1 shows an example that compares the case numbers from April 20 and May 20, 2020 using a pie chart per country. You can see how much the number of cases increased within just a month, especially in North America but also South America, in parts of Europe, and Russia. They exceed the numbers in the country of origin, China, by far.

The African continent supposedly has low case numbers. But is this really true? Researchers encounter another challenge here. They can access reported data from almost any country but have no influence on their quality. This not only refers to the correctness of the transmitted data but especially to definitions and assumptions of local authorities, for example, which criteria are used to decide whether a suspected case is reported as infected or not? Especially at the beginning of the year, there was a lack of capacity for comprehensive testing. Instead of a PCR test for viral RNA, other indicators, such as CT images of the lungs, were used to diagnose a case. These different procedures led to different measurement errors in the reported numbers per country.

In African countries with a less well-positioned health care system, systematic testing of suspected COVID-19 cases is extremely difficult. The documentation of suspicious cases and collection of data from regional medical centers also pose even logistical problems for local governments. Based on experience from previous epidemics, it can be assumed that the publicly reported figures represent only a small fraction of the reality. In addition, fortunately only a relatively small proportion of those infected develop serious symptoms that require hospitalization.

Quick prognoses thanks to artificial intelligence

For Germany, we know that many infected people only have slight or no symptoms and are, therefore, not registered even when they see a doctor. To take this error into account in national figures, residents of selected German Coronavirus hotspots were interviewed and tested by researchers. It is hoped that these regional studies will provide a more precise forecast of the real number of cases in Germany and a better understanding of the way the virus is transmitted. The database of the recorded COVID-19 data is also used at the HPI as the basis for forecasts. Methods of machine learning and artificial intelligence are used, for example, to forecast the number of cases in other countries or the effectiveness of measures based on the developments in China.

The sooner you can access the latest nationwide data, the faster you can take appropriate measures to deal with the current and any future pandemic. Established clinical processes for systematic testing, a central register for recording suspected cases as well as suitable IT tools for interactive and flexible evaluation of the data provide the basis for enabling medical experts to react even more quickly to the next pandemic.

The Researcher

Dr.-Ing. Matthieu-P. Schapranow is leader of the working group “In-Memory Computing for Digital Health” and Scientific Manager Digital Health Innovations at the Hasso Plattner Institute (HPI). He performs voluntary work, for example, as a member of the Platform Learning Systems, the working group E-Health of the Federal Association for Information Technology, Telecommunications and New Media (BITKOM), and the Global Alliance for Genomics and Health.
Mail: schapranowhpide


This text was published in the university magazine Portal Wissen - Two 2020 „Health“.