Data is all the rage. Satellite images make the entire world available - in detail and around the clock. Humans, too, are being captured down to the last detail, from their genetic material to their heartbeat. Traffic flows, cell structures, internet flows - and that’s only the beginning. But the more data that is available, the more pressing becomes the question of how to arrange, analyze, and interpret them. Mathematical models provide a solution. They can structure large amounts of data and make them “readable”. Model and data, however, do not always come together easily. This is where the Collaborative Research Center (CRC) 1294 comes in. Its title says it all: “The seamless integration of data and models”. Matthias Zimmermann interviewed Prof. Sebastian Reich, speaker of the CRC, and his deputy Prof. Wilhelm Huisinga.

## Data assimilation bridges theory - that is, the mathematical model - and practice - the measured data that should go into this model. How does this work?

**Sebastian Reich:** Thanks to modern computing technology, mathematical models can be simulated. Incidentally, this also creates data, in a way a reflection of the phenomenon that one wants to model. On the other hand, there is experimental data that you obtain from measurements. The goal of data assimilation is to bring these two worlds together and to use the experimental data to calibrate, validate, and compare models, verify model approaches and so on. You want to combine the best of both worlds, and this happens algorithmically through the assimilation of data in models.

**Wilhelm Huisinga:** May I disagree right away? (Laughs) I would say this “best of both worlds” does not exist as such. It only becomes possible as a result of the combination. Big data is often said to generate knowledge through large data sets. Here we would say: Collecting a lot of data does not necessarily yield new findings. It also takes innovative methods to turn this data into knowledge. And that is done through combining experiment and model. Principally, models contain our idea of the processes underlying the measured data. Only by combining models and data are you able to generate knowledge.**Reich: **Although this distinction is also quite subtle. Scientific models use principles for model design. But we also have projects where these principles are not necessarily known. Cognitive science, for example, is about discovering these principles in the first place - and for this, you also need data. You can suggest different models, but to decide which model is the more appropriate one, they have to be matched with data. This is done in data assimilation. It is the interface between statistics, which has historically dealt with how to model data, and applied mathematics, which is primarily devoted to the development of models and their analysis, but also to machine learning. The latter essentially also deals with the question of how models that fulfill certain tasks can be generated from data.

## How did the field of data assimilation evolve in the first place?

**Reich: **In close relation to meteorology. Although we have a relatively good understanding of how weather phenomena occur, meteorology is about making predictions. The statistician George Box put it nicely when he said, “Essentially, all models are wrong, but some are useful.” It means that you have to keep adapting models to reality. This also applies to weather forecasting models. In this context, data assimilation has made enormous progress. The availability of satellite data in the southern hemisphere, for example, has immensely improved the quality of forecasts. The CRC, however, deals not only with predictions. We also want to find models that can explain things.

**Huisinga:** An example from the CRC: A subproject explores the question of how to determine whether a person has previously seen a picture they are looking at. Here, the mechanism of information retrieval in the brain and its impact on eye movement is not important. In other areas, we also want to develop an understanding of why something is the way it is. To Box’s quote I would like to add: Actually, we should say that all models are approximations, i.e. methods of approximation. Some of them are useful, while others are completely futile so that you cannot do anything with them.

**Reich:** … “Wrong” is, of course, a bit exaggerated. The models have an approximation quality, but errors occur, that's the important point. And these errors accumulate, for example when it comes to weather forecasts, so that the forecast quality is very low after seven days. Only a constant adaptation of the model to the data allows for repeatedly good predictions. This is how we have to interpret George Box’s statement.

**Huisinga:** That’s exactly why data assimilation is such a nice combination. One has an explanatory or even predictive model that seems off after a few days, and so you force the model to constantly respond to reality - through data assimilation.

**Reich: **This is like a learning process. The model is constantly learning through these data.

## Are the “data assimilation algorithms” different from the mathematical models for simulating processes themselves?

**Huisinga:** I would say yes. In meteorology, to stick to the example, there are equations that describe the various processes and make a prediction for a specific time and place. When a measuring point is added, it needs to be correlated with the prediction. I would say that is a kind of statistical model which is based on the actual mechanistic model and assimilates data and models.

## How does this work mathematically?

**Reich: **You can see it as a kind of feedback effect. The original model has an output, for example a prediction repeatedly compared to new observations, that is, to data. The model is adjusted based on this data. This feedback loop is relatively independent of the actual model. Therefore, Research Area A of the CRC deals with general algorithms of data assimilation as an independent mathematical problem. Research Area B deals with concrete applications, for which the algorithms have to be adapted: Which processes are behind all this? How can you achieve such feedback? What is the specific task: prediction, classification, or model verification? A good example, even if it is not yet relevant at the CRC, is autonomous driving. A self-driving car has many sensors that can discern the environment. The car has to react to its data. This requires a basic model defining what the vehicle does when and how. A feedback system takes into account the constantly incoming measured data. Similar questions will have to be addressed in personalized medicine.

**Huisinga: **The great thing about mathematics is that it’s a kind of general language that can be used to describe phenomena - in an abstracting way. This shows that different application problems are based on the same mathematical questions or models. This is also the case in the CRC where the same mathematical process is important in two projects, even when one is on earthquake research and the other on the movement of amoebae. What’s special about this process is that the occurrence of one event has a feedback on future events. A big earthquake entails many small earthquakes. Amoebae, in turn, move by creating coordinated protuberances of the membranes called pseudopodia. The probability of a pseudopodium forming is higher in the vicinity of existing ones. This is how the cell moves in one direction. This shows that earthquakes and amoebas are - mathematically - much closer than one might expect at first glance.

**Reich: **The other important aspect is that only mathematical modeling and modern computing technology have enabled us to extensively analyze, simulate, and even predict complex processes. Modern computers enable us to determine planetary orbits thousands of years in advance, or to predict the probability of earthquakes.

## Is there anything like general algorithms for data assimilation?

**Reich: **Yes, there are certain basic principles. One of the “classics”, the Kálmán filter, dates back to the 1960s. Rudolf Kálmán was a Hungarian mathematician who developed such an algorithm for data assimilation of linear models, i.e. a limited class of models. This played an important role in the Apollo program, for example. But there are other methods and different statistical techniques. Our CRC also wants to consolidate these a bit, derive principles and, in particular, develop new algorithms.

**Huisinga:** Another goal of the CRC is to bring data assimilation from classical applications, such as geosciences and meteorology, to new areas, such as cognitive science or biology. One of our application examples is on how to use these data assimilation techniques profitably in pharmacology.

**Reich:** For a long time, data assimilation was primarily developed by its users. Over the past decade, however, there has been a growing interest in understanding these things mathematically and analyzing what’s actually going on. The CRC is expanding on these developments.

**Huisinga:** Ultimately, the users will also benefit from it. Especially in practice, you have to know when models are working and, even more importantly, when they are not, because the users will apply them to everything. Sometimes you may not realize that a surprising prediction is not a real phenomenon, but simply results from the fact that the math or the model does not hold up.

**Reich: **Meteorology, for example, is now reaching the limits of the algorithms that are used. As the models become ever more detailed and increasingly three-dimensional, new data assimilation algorithms are needed.

## Has the “star of data assimilation“ risen with the age of big data?

**Reich: **Mathematically, they overlap in many ways. The difference, in my view, is just that we are quick to speak of “big data" when there are very large data sets. In other words, you have enormous amounts of data to find many parameters. For other applications, where there may not be as much data, more specific models need to be developed. That’s a question of balance. Many algorithms used in data assimilation are increasingly being used in machine learning and vice versa.

## For successful data assimilation, do you need a mathematician for the model, a scientist for the data collection, and a data scientist for its implementation?

**Reich:** It usually starts with a user saying, “I have a question...”. Often, the users are mathematically well-educated and already have a model, which they also work with but which they don’t know exactly. Then there is a mathematician who tries to analyze and improve the model. Ideally, this leads to a dialog during which they jointly develop an algorithm that is also practicable because the user needs a forecast at a specific time - the weather tonight and the earthquake warning before the quake.

**Huisinga:** At the CRC, the mathematicians are responsible for Research Area A and the users for Research Area B. The important thing is to bring them into conversation with one another.

**Reich: **Exactly. But looking at the B projects, it’s clear that mathematicians and users always work on them together. That is also the exciting thing about the projects: As a specialist, you first have to understand the language and problems of others. Only then will you be able to create something together.

## You have mentioned the two big areas of the CRC. Could you quickly explain what Research Area A does?

**Reich:** A lot of math. *(Laughs)* There are six subprojects ranging from statistical questions in connection with high-dimensional stochastic processes to statistical questions of inverse problems. For example, there are two projects that deal especially with the already mentioned aspect of learning, that is, with the continuous adaptation of models to data. Furthermore, Project A03 deals with the question of how to best make decisions, and Project A05 deals with point processes.

## What applications does Research Area B focus on?

**Huisinga:** As mentioned, we are studying how amoebae move in a biophysical project. This is about model development as well as data collection. A second focus is earthquake research, a well-established field of research in Potsdam. The goal is to model earthquakes not only dispersed over time, but also over space. We have two projects in the field of cognitive sciences which we want to make available for data assimilation. One project creates cognitive models of movement, while the other examines, as mentioned, whether you look at a picture differently if you have seen it before.

**Reich: **The last project is about space weather. This refers to the influence of solar activities on the radiation belt that surrounds the Earth and is of great importance to satellites. All in all, the researchers at the CRC merge basic mathematical research for data assimilation with disciplines in which the University of Potsdam has proven itself strong in research - above all, biology and the cognitive and geological sciences. The exciting thing about such a CRC, however, is of course also the interaction between projects. Some do this, others that - but how do they come together? You could, for example, model amoebic motion via stochastic partial differential equations, and then you suddenly have an interaction between a rather application-driven and a mathematically motivated project. In such a context, it is much more likely that people get in contact with one another who might otherwise be able to get in contact, but wouldn’t actually end up doing so.

The **Collaborative Research Center (CRC) 1294** with the title **“Data Assimilation - The Seamless Integration of Data and Models”** focuses on the integration of large data sets into complex computational models. This is meant to facilitate a better understanding of underlying processes as well as more accurate predictions. In meteorology, hydrology, and the search for raw materials, data assimilation methods are already being used very successfully. In the future, new application areas in the fields of biology, medicine, cognitive and neurosciences are also expected to benefit. This urgently requires a theoretical foundation of the existing algorithms and development of novel ones for data assimilation.

The Collaborative Research Center 1294 comprises 11 scientific subprojects, a data infrastructure project, and a graduate college. In addition, it also includes a central administrative project. Of the 17 applicants of the CRC, two are from the Helmholtz Center Potsdam - German Research Center for Geosciences, one from the Weierstrass Institute Berlin, one from Humboldt Universität zu Berlin, two from the Technische Universität Berlin, and eleven from the Departments of Mathematics, Physics and Astronomy, Computational Science, and Psychology of the University of Potsdam.

## The Researchers

**Prof. Wilhelm Huisinga** studied mathematics in Berlin. Since 2010, he has been Professor of Mathematical Modelling and Systems Biology at the University of Potsdam and Deputy Speaker of CRC 1294.

Mail: huisingauuni-potsdampde

**Prof. Sebastian Reich** studied electrical engineering and mathematics at TU Dresden. Since 2004, he has been Professor of Numerical Mathematics at the University of Potsdam. He is Speaker of the CRC 1294.

Mail: sebastian.reichuuni-potsdampde

*This text was published in the university magazine **Portal Wissen - Two 2019 „Data“.*