A cup half filled with coffee tilts, tilts further and further, and the milk foam reaches the rim of the cup.

"Stop," shouts Alexei Efros, computer science professor at the University of California at Berkeley and one of the world's leading minds in machine vision.

This afternoon he is the director of the improvised scene in the garden of the New University in Heidelberg.

"Well, what happens next?" The cup remains at the ominous angle.

A machine must be able to answer this question, Efros explains to the group of doctoral candidates and students standing around.

For people it is clear: If you tip the cup further, the coffee lands on the grass.

Coffee is a liquid, liquids flow, and without support everything falls down.

That's common sense.

Such insights are still difficult for computers.

It is the major flaw in the machine learning success story.

Learning algorithms defeated the world's top Go player in 2016, they are used for speech recognition and translation, recently "Alpha Fold" predicted 200 million protein structures based on the amino acid sequence.

Machine learning and artificial intelligence (AI), the generic term for the research area, are the unofficial topics of this year's Heidelberg Laureate Forum.

Initiated by the Klaus Tschira Foundation, the event is the counterpart to the Nobel Laureate Conference in Lindau for the subjects of mathematics and computer science.

For a week, Fields Medalists, Abel and Turing Prize winners give lectures to doctoral students from all over the world - and discuss, among other things, solutions to the common sense problem.

Efficient pattern recognition

"Be careful about saying that deep learning can't do something," says Yann LeCun in a forum panel discussion on the future of machine learning.

That mostly turned out to be a mistake.

The top AI researcher at Meta is considered one of the "godfathers" of the deep learning revolution. He was already doing research in the 1980s, when many still thought it was a mistake - he has now received the Turing Prize, the highest Computer Science Award.

The central element of the software is an artificial neural network inspired by the human brain.

These neural networks can be efficiently trained with large amounts of data to recognize patterns in similar data.

This usually only works well when the neural network is networked in many layers, i.e. it is “deep”.

At the end of the panel discussion, Joseph Sifakis, also a Turing Prize winner, will be in the front row.

He's less optimistic.

Just recently, the image recognition of a car confused the moon with a traffic light.

Sifakis is researching autonomous driving – of course, such mix-ups are dangerous.

There is a fundamental problem, he says.

LeCun on the podium sees it differently, the discussion heats up and the moderator asks to be continued during the coffee break.

What is the difference between a cup and a pot?

A classic approach for AI was to provide knowledge in the form of statements: "Coffee is a liquid", for example.

The advantage is that a programmer can understand why the machine makes a certain decision and that one can give explicit instructions.

However, you quickly get into trouble this way: How many criteria do you have to specify in order to distinguish a cup from a pot?

With neural networks, on the other hand, no criteria are specified at all, only the input (an image) and the output options (traffic light or no traffic light).

Hundreds of thousands of pictures of traffic are shown, and if the program finds a traffic light correctly, it says: more of it.

But what it does with it internally is not transparent.

Explainability and man-made criteria are sacrificed for the ability to learn from vast amounts of concrete data.

Sifakis sees the problem in this.

As long as one cannot combine symbolic knowledge - a traffic light is not hanging in the sky - with concrete knowledge - the similarity to images of traffic lights in the training data - one should not trust these machines to participate in driving.

He is not alone with this criticism: The AI ​​researcher Gary Marcus, for example, advocates combining symbolically expressed rules with deep learning.

Yann LeCun does not believe in this idea, the symbolic does not fit into the mathematical model of deep learning.

The question is to what extent real understanding can be implemented in a neural network at all.

He has the answer himself: "We know that it works," and points to his head.

"We do it here."

Learning by video

The tipping coffee cup provides an example of how this could work.

A machine learning algorithm could be trained with a video of this process: just before the coffee overflows, the video stops and the machine has to predict what will happen next.

The next frame in the video then tells the machine whether it is correct.

Alexei Efros, like LeCun, sees these training experiments as an opportunity to let machines learn physical processes from scratch.

In contrast to "supervised training", in which a person gives the correct solution beforehand and labels the pictures - moon, traffic light, cup, pot - this would then be "self-supervised training".

Ultimately, this is a continuation of the deep learning approach: let the machine figure out how to find the solution itself.

Efros thinks that's right: "I was always against the point of view: Algorithms are everything, data is nothing," he says.

Such a preference only comes from the fact that algorithms are man-made.

"The opposite is true: data is everything."