Sora is popular, is general artificial intelligence coming?

Just a piece of prompt text can generate a 60-second coherent video

  ◎Our reporter Cui Shuang

  At the beginning of 2024, Sora was born, dropping a blockbuster on the AI ​​world.

  This Vincent video model released by the American artificial intelligence company OpenAI only needs a prompt text to generate a high-definition video with multiple characters and specific action types, and the theme and background are basically accurate.

Compared with AI video generation applications such as Runway Gen 2 and Pika, which produce videos with a few seconds of continuity, Sora can generate continuous, stable, and high-quality videos of up to 60 seconds, with more complete prompt text and more accurate details. The video is more realistic.

  However, due to concerns about possible abuse, OpenAI stated that it currently has no plans to publicly release Sora.

Limited access to the model is only given to a small group of researchers and creatives so that OpenAI can get feedback on their use.

  Currently, 48 demonstration videos generated by Sora have been updated on the official website.

The clear and realistic details and ultra-high accuracy of these videos can’t help but trigger people to think: Does this mean the arrival of artificial general intelligence (AGI) with human-level intelligence or beyond?

It is of great significance to the study of AGI

  After Sora came out, Zhou Hongyi, founder of 360 Group, expressed his opinion: The emergence of Sora has brought forward the arrival of AGI.

It was originally estimated to take ten years, but now it may only take two or three years.

He believes that although Sora looks like just a Wensheng video tool, it is actually a milestone for AI to recognize and interact with the world, and will bring huge progress to the entire industry.

  "The technical routes to achieve AGI are diverse, involving different research methods and application directions." Wang Jinqiao, deputy chief engineer of the Institute of Automation, Chinese Academy of Sciences and executive deputy director of the Zidong Taichu Large Model Center, told the Science and Technology Daily reporter that currently, academic and industrial There are three main AGI technology routes widely discussed in the industry.

The first is information intelligence, which is “big data + self-supervised learning + large computing power”.

This method relies on large amounts of data to train models through self-supervised learning algorithms, and requires huge computing power to handle complex tasks.

The second is game intelligence.

This technical route emphasizes training intelligent agents through reinforcement learning in human-computer interaction, so that they can learn and make decisions autonomously.

The third is brain-like intelligence.

This approach attempts to implement AGI by imitating the way the human brain operates.

  In Wang Jinqiao’s view, according to the demonstration video on the official website, Sora has achieved breakthroughs in at least image quality, long video generation, multi-lens consistency, learning world laws, and multi-modal fusion.

  "Sora can cause such a sensation not just because the videos it generates are longer and higher-definition, but because it can simulate the movement and interaction of objects in the physical world to a certain extent." Wang Jinqiao said, "This This capability is of great significance to the research of AGI because it involves the machine's in-depth understanding and high degree of simulation of the real world, and these are the core challenges in realizing AGI."

  The reporter learned that in order to accurately simulate the physical world, Sora was fed extremely large-scale training data and used advanced algorithms such as diffusion models.

"For AGI, Sora allows everyone to see that the scale effect is not only true in text mode, but also in video mode." Zhou Xinyu, co-founder of Beijing Dark Side of the Moon Technology Co., Ltd. (Moonshot AI) believes, " A general physical world simulator can be built by extending the video generation model. This is a necessary process to achieve AGI."

There is still a distance to truly realize AGI

  Although the progress is remarkable and impressive, Sora still has some technical shortcomings.

  Judging from the videos generated by Sora so far, it may make mistakes when processing certain details, such as confusing the left and right directions of objects.

At the same time, it cannot fully understand complex cause-and-effect relationships or maintain a highly consistent storyline over a long period of time.

These technical defects may cause logical errors in the generated video content, or may be inconsistent with common sense and real situations.

  "The way Sora simulates the real physical world is by modeling given text, images, and reference videos, and then predicting the conditional probability distribution of the video data you want to generate. This is not essentially different from the principle of language models. It is also We are doing lossless compression." Zhou Xinyu said, "As long as the compression is good enough, a realistic enough physical world can be simulated."

  Wang Jinqiao emphasized that although Sora can understand surface movements and interactions through learning, it has not yet learned the essence of physical laws.

For example, it doesn't know how strong the wind can blow out a candle, and it doesn't understand the essential reason why glass will break if it falls to the ground but will not break if it falls on the carpet.

This is also the most criticized aspect of Sora.

  "Judging from the few public information on Sora, it is still a data-driven fitting, which is to simulate the physical world that humans can see. But the real physical world contains far more than just human visual information." Beijing Zhongguancun Science and Technology Zhang Jie, technical vice president of Jin Technology Co., Ltd., believes that Sora's creativity comes from probability fitting under large amounts of data. It does not generate new knowledge and is still a long way from the goal of "deeply simulating the real physical world."

  Duan Weiwen, director and researcher of the Philosophy of Science and Technology Research Office of the Institute of Philosophy, Chinese Academy of Social Sciences, also expressed a cautious view.

"Sora's near-human expression is actually a synthetic intelligence based on existing data and corpus." He said, "It has found a feasible path to realize AGI, but there is still a long way to go before real AGI. distance, and the value for realizing AGI is relatively limited.”

  In fact, there is a long way to go to achieve the goal of AGI.

Wang Jinqiao talked about several major challenges.

The first is the data bottleneck.

Although pre-trained language models like GPT-4 have made progress in data annotation, data is still a key limiting factor in deep learning; followed by the generalization bottleneck.

Current AI systems often perform well on specific tasks, but have difficulty adapting effectively when faced with new tasks; finally, there is an energy consumption bottleneck.

As AI models become more and more complex, the required computing resources and energy consumption are also increasing.

This places higher demands on hardware equipment.

It may be the first to land in the media field

  The release of Sora not only promoted the development of technology, but also triggered discussions on AI governance and ethics.

  Duan Weiwen mentioned that OpenAI has taken relevant measures to prevent the release of inappropriate videos.

Wang Jinqiao further explained that Sora’s built-in text prompt filter can filter all prompts sent to the model and block requests for sensitive or inappropriate content such as violence, pornographic content, hate speech, and celebrity portraits.

The video content filter can inspect the generated video frames and block content that violates OpenAI security policies.

  In addition, the OpenAI team may regularly optimize and update Sora to improve its filtering mechanism and ensure that the model can better identify and process sensitive content.

At the same time, the team may monitor system usage to identify and resolve emerging issues in a timely manner.

  "Technically, Sora's way of avoiding extreme violence, pornography, celebrity portraits and other content mainly relies on the alignment ability of the model." Zhou Xinyu said, "This is not much different from the language model, and there are already many practical experience.”

  According to predictions from International Data Corporation, Sora will be the first to be applied in media fields such as short videos, advertising, interactive entertainment, film and television production, and media.

Sora's many capabilities can assist workers in these fields to create videos more efficiently, speed up production, and increase output quantity.

This will help relevant industries reduce costs, improve efficiency, and further optimize user experience.