Visualizing the spread of infectious diseases using public health data

Each of us creates data when we answer polls or questionnaires, obtain different documents (e.g. a drivers licence, a passport etc), change employment, pay taxes, buy or rent a place, visit a doctor or interact with any public agency. All of these actions create immense amount of data for the public sector, which is not always publicly accessible.

This data can be used for different purposes, for example, providing information about cost of living, crime rates or studying public health. To justify collection of public health data to the population, its usage and potentials should be transparent and understandable to the public. Just like collected weather data is used in weather prediction applications, where a user can see weather maps and forecasts, why shouldn't we have disease prediction maps and forecasts?

Providing information about potential risks of certain diseases in the area, together with prediction maps or risk awareness maps and information about precautious measures would be beneficial for the population, since it would communicate risks, raise awareness, and in general, keep population safe.

How to get started?

Public-health agencies have the responsibility to detect, prevent and control infections in the population. In Germany, the Robert Koch Institute collects a wide range of factors, such as location, age, gender, pathogen, and further specifics, of laboratory confirmed cases for approximately 80 infectious diseases through a mandatory surveillance system.

This data is publicly available, but in order to be useful for a broader public, it should be processed and presented for communication in an interpretable form, using data visualizations and interactive tools. One example are the vaccination maps of Germany. Visualizations like these summarize data and convey a message to laypeople, as well as to policy makers. The same principle could be applied to communicating predictions of infectious diseases across regions.

Bayesian spatio-temporal interaction model (BSTIM)

As one example of communicating disease predictions in practice, we develop and present a Bayesian spatio-temporal interaction model (BSTIM), a probabilistic model which predicts aggregated case counts within counties and calendar weeks. To this end, publicly available health data, region-specific and demographic data are used. For an in-depth explanation of the model and its implementation, readers are referred to the original paper and the model implementation.

We evaluated the BSTIM on a one-week-ahead prediction task for two diseases (campylobacteriosis and rotaviral enteritis) across Germany and for Lyme borreliosis across the federal state of Bavaria.

The BSTIM model predicts how many people are expected to become infected during the next week, in each county. In addition, it provides uncertainty estimates, which give a sense of how confident the model is.

Furthermore, it is possible to see how the model makes predictions, by visualizing model components and the learned disease dynamics. This transparency and interpretability of machine learning models increase scientific understanding and safety.

Visualizing model predictions

To visualize the performance, reported data and the time series of predictions are plotted, together with the quantile ranges as a measure of uncertainty. This type of visualization is both valuable for the public and the domain experts.

For example, the one-week-ahead predictions are shown for two selected cities (Dortmund and Leipzig for campylobacteriosis and rotavirus, Nürnberg and München for borreliosis). Below the time series, a choropleth map of Germany (or the federal state of Bavaria in the case of borreliosis) shows the individual predictions for each county in one calendar week as an example.

curves.pdf

Predictions of case counts for various diseases by county. Reported infections (black dots), predictions of case counts by BSTIM (orange line) and the hhh4 reference model (blue line) for campylobacteriosis (column 1), rotavirus (column 2) and borreliosis (column 3) for two counties in Germany (for campylobacteriosis and rotavirus) or Bavaria (borreliosis), are shown in rows A and B. The shaded areas show the inner 25%-75% and 5%-95% percentile. Row C shows predictions of the respective disease for each county in Germany or the federal state of Bavaria in week 30 of 2016 (indicated by a vertical red line in rows A and B).

Visualizing the model components

Domain experts might have different perspective on the model and need different outcomes, in addition to model predictions. Visualizing learned model components allows to inspect how diseases spread in time and space. Similarly, visualizing learned trends and seasonality allows to see temporal evolution of diseases over the years, together with months of lower and higher activity.

This allows for validation of the whole model, specifically in regards to fairness. This feature of the BSTIM is in line with the need of transparent and interpretable machine learning systems, more so if they are set for the public usage.

Our visualizations of learned components are shown in figures below.

interaction_kernels.pdf

Learned interaction effect kernels. Kernels for campylobacteriosis are shown in 1A-C, for rotavirus in 2A-C and for borreliosis in 3A-C. Mean interaction kernels are shown in the row A, while rows B and C show two random samples from the inferred posterior distribution over interaction kernels.

temporal_contribution.pdf

Learned temporal contributions. Periodic contributions over the course of three years (2013-2016) for all three diseases are shown in the row A, trend contributions in the row B and their combination in the row C. Red lines show the mean exponentiated linear combination of periodic or trend or both features through the respective parameters. Dashed lines show random samples thereof; the shaded region marks the 25%-75% quantile.

Conclusion

In this article, we showed one example usage of public health data. With appropriate tools and machine learning models, already collected data has a great potential to increase interpretability and transparency of the work of public health institutions, and to provide beneficial information to the population.