Machine learning approaches are often successful in practice, be it in route planning, speech recognition or image processing.

But in the field of research, scientists at Princeton University recognize a crisis: the results of many studies are not reproducible.

In a current preprint, they list 329 specialist articles from various disciplines from which problems are known - research on neuropsychiatry, genomics, IT security, toxicology or bioinformatics.

"Apparently all fields discover the errors independently of each other," says computer scientist Sayash Kapoor from Princeton.

Hinnerk Feldwisch-Drentrup

Editor in the department "Nature and Science".

  • Follow I follow

They became aware of the problem when they took a closer look at approaches to predicting civil wars, says Kapoor - there have been a number of studies on this in recent years.

All those who wanted to have achieved better results for the predictions than classical statistical approaches had significant reproducibility problems.

With these machine learning methods, a training data set – for example with information on economic development or social conditions before civil war situations – is used to automatically identify information that has a predictive value.

Subsequently, the trained algorithm is applied to a test data set with and without civil war times to calculate its prediction quality.

It is of central importance that the data used for the training did not already contain information from the test data or other information that would result in an inflated prediction quality - in the case of the civil war example, for example, data from the same period should not be contained in the two data sets.

Almost all errors lead to overestimation of performance

Last week, Kapoor, together with his colleague Arvind Narayanan, organized an online workshop with more than 1600 participants on the subject of the lack of reproducibility.

Kapoor says his favorite example is an algorithm designed to detect high blood pressure in hospital patients.

Information on the patient's medication was also available to him - in the end, the algorithm simply recognized the high blood pressure because the patients were taking antihypertensive drugs.

Researchers speak of a “data leak” here: when an algorithm is provided with information that artificially improves its predictive power.

While errors in applications that are already widely used can be noticed, at least in everyday life, it is more difficult in research: Normally, the results are reported in specialist articles that are based on test data sets - and are often far too good to be true.

"Almost all mistakes lead to the performance being overestimated," says Narayanan - there is a "rampant over-optimism" that may have something to do with the fact that commercial providers also make big promises.

The researchers see a considerable need for action and therefore presented approaches at the workshop as to how such errors can be recognized and avoided.

Is the artificial intelligence hype to blame?

Moritz Hardt also emphasizes that the separation of the training and test data is essential - he has been Director of the Max Planck Institute for Intelligent Systems in Tübingen for almost a year and previously did research in the USA, partly with the team in Princeton.

However, many problems have actually been known for a long time.

"What's new is that the hype surrounding artificial intelligence is trying to apply it to new areas of science," he says.

The performance of machine learning methods generally depends heavily on the data used: an algorithm that delivers good results on data from a Frankfurt clinic is not necessarily transferrable to data from a Munich clinic anyway, if the data is collected slightly differently.

In addition, there are statistical problems

some of which were already being discussed in the 1990s - because the approaches are increasingly being used in situations for which they were not intended.

"There are many ways that machine learning can fail," says Hardt.

He does not want to talk about a general mood of crisis - there is a lot of optimism about commercial applications, but the research side is sometimes more pessimistic.

However, this could also enable progress: "Of course, it may be that we now have a better understanding of when machine learning cannot be used," says Hardt.

It is important to keep an eye on the political context: if algorithms are to make decisions that have serious consequences, questions about the validity of the results are particularly relevant.

The computer scientist Katharina Morik from the TU Dortmund sees it similarly - non-specialists would sometimes use machine learning methods incorrectly without the appropriate training.

"It is often not recognized that artificial intelligence and machine learning in particular requires thorough study," says the expert;

Software tools are easily available and easy to use.

"This may tempt scientists from other disciplines to dare to analyze data without any knowledge," says Morik.

"It needs a lot more professorships for machine learning so that enough people can be trained."