China News Weekly reporter/Hu Yong

  Published in the 1131st issue of "China News Weekly" magazine on March 11, 2024

  Recently, Sora, developed by OpenAI, an American artificial intelligence research company, came out and attracted widespread attention around the world.

While people are surprised by its powerful text-to-video function, they are also worried about whether the boundary between reality and falsehood will become more difficult to distinguish.

What is Sora? Is it a Ma Liang "magic pen" or a super monster?

  Sora’s technical achievements and limitations

  Sora is an advanced text-to-video conversion model developed by OpenAI. Its functions and application range demonstrate new horizons of modern artificial intelligence technology.

The model is not limited to generating videos of just a few seconds, but can also produce videos up to a minute long, faithfully reproducing user instructions while maintaining high visual quality.

For users, it feels like a dream has become a reality.

  Currently, Sora is in an exclusive beta phase and is only available to select red teamers (a group of experts tasked with questioning a plan, strategy, policy, or product from a confrontational perspective), visual artists, designers, and filmmakers. .

This strategic move ensures that technology not only meets but exceeds the highest standards of creativity and safety before widespread release.

Once Sora can be made public and used by more people, it will have a more significant impact on a global scale.

  Sora’s technical prowess demonstrates the great strides made in the field of artificial intelligence.

Sora represents a leap from static image generation to dynamic video creation, a complex process that involves not only visual rendering but also the understanding of motion and time progression.

This advancement marks a sea change in AI’s ability to interpret and visualize narratives over time, making Sora more than just a tool for creating visuals and becoming more of a storyteller.

  The shockwaves from this breakthrough are expected to span all aspects of video creation, but it's also likely to evolve from video to 3D modeling.

Judging from the current demonstration, Sora can understand how the elements described in the prompt exist and operate in the physical world.

This enables the model to accurately represent the user's intended actions and behaviors in the video.

For example, it can realistically reproduce the scene of a person running or the movement of natural phenomena.

Additionally, it accurately renders multiple character details, action types, and thematic and background subtleties.

  At the same time as Sora was released, OpenAI published a corresponding technical document called "Video Generation Model as a World Simulator".

"We found that video models, when trained at scale, exhibit a number of interesting emergent capabilities. These capabilities enable Sora to simulate certain aspects of people, animals, and environments in the physical world," the technical paper reads. Dr. Jim Fan, a senior researcher at NVIDIA, made deeper speculations about how Sora builds the world model internally.

"If you think Sora is a creative toy like DALL-E... you're wrong. Sora is a data-driven physics engine."

  That is to say, although Sora is currently considered just a video generation model, computer scientists like NVIDIA senior scientist Jim Fan believe that Sora is essentially a learnable simulator or world model.

This shows that it is possible for artificial intelligence to understand physical laws and phenomena from large amounts of real-world videos and those that consider physical behavior (such as videos in the game engine Unreal Engine, although OpenAI does not explicitly mention this).

  If so, text-to-3D is very likely to appear in the near future.

By then, not only videos shot from multiple angles, but also visual effects production in virtual spaces (such as the Metaverse) can soon be easily generated by artificial intelligence.

  Judging from the videos currently released by OpenAI, the production quality is quite high.

Many of the videos are cinematic; all are high-resolution and most look like they're real - unless you watch them in slow motion.

The camera lens will pan and zoom, and the movement of characters and scenes in the 3D space will be consistent. At first glance, you may not even realize that you are watching a composite image.

  To achieve greater fidelity, Sora combines two different artificial intelligence approaches.

The first is a diffusion model, similar to the one used in image generators such as DALL-E.

This type of model learns to gradually transform randomized image pixels into coherent images.

The second is a transformer architecture for contextual analysis and splicing of continuous data.

For example, large language models use transformer architectures to combine words into generally understandable sentences.

During the video generation process, OpenAI breaks down video clips into visual “spacetime patches” that Sora’s converter architecture can process.

  However, like any breakthrough technology, Sora has its limitations.

Despite the model's advanced capabilities, it is sometimes difficult to accurately simulate the physics of more complex scenes.

This can result in visual effects that are impressive but occasionally defy the laws of physics or fail to accurately represent cause-and-effect scenarios.

For example, the way characters interact with objects in a video may not be physically possible or consistent over time.

  Therefore, although Sora claims to be learning physics, it is not yet able to accurately establish a physical model.

OpenAI's official blog notes that it has trouble simulating physics, understanding cause and effect, and other simple details.

For example, ask to generate a video of a person biting a cookie, only to find that no bite marks were left on the cookie; or a man running on a treadmill the wrong way.

It can also get confused by the spatial details of cues, such as following a specific camera trajectory.

  Sora captures cities and territories on multiple battlefields

  While not perfect, it's hard not to be impressed by the quality of Sora's early examples and how it might ultimately reshape the video, film, gaming, and other industries.

  On the video side, companies other than OpenAI, from giants like Google to startups like Runway, have launched text-to-video AI projects.

But what makes Sora unique, OpenAI says, is its stunning realism and its ability to generate longer clips than the short clips other models typically come up with.

  For example, a video clip released by OpenAI asked for an "animated scene of a short furry monster kneeling next to a red candle," along with some detailed stage instructions ("open your eyes and open your mouth") as well as instructions for Description of the desired atmosphere.

As a result, Sora created a Pixar-esque creature that appears to have DNA from the monsters in Monsters, Inc.

When Monsters, Inc. came out in 2001, Pixar trumpeted how difficult it was to create the ultra-complex textures of the monsters' fur, because those textures changed when the creatures moved.

The wizards at Pixar spent months getting it just right.

And OpenAI’s new text-to-video machine seems to do this easily.

There is no coding involved; Sora learns 3D geometry and consistency purely from large amounts of observed data.

  While the scene is certainly impressive, the most astounding of Sora's abilities are those for which it has yet to be trained.

As mentioned earlier, Sora is powered by a version of the diffusion model used by OpenAI's DALL-E 3 image generator, as well as GPT-4's Transformer-based engine, and is not only able to produce videos that meet the prompt's needs, but while doing so, Also demonstrates a new understanding of film grammar, which translates into storytelling talent.

  For example, another video was created based on "a colorful papercraft world of a coral reef, filled with colorful fish and marine life."

Researchers found that Sora created narrative themes through camera angles and timing.

"There were actually multiple shot changes - these weren't stitched together but were generated in one go by the model," he said. "We didn't tell it to do it, it just did it automatically."

  One feature of Sora that the OpenAI team didn't demonstrate and likely won't release for quite some time is the ability to generate video from a single image or a series of frames.

This will improve your storytelling skills: you can accurately draw your ideas and then bring them to life.

Judging from the storytelling situation, Sora can show an understanding of editing and rhythm, and seems to have preliminary directing abilities.

  However, it will be a long time before text-to-video threatens actual filmmaking, and it may even never happen.

You couldn't make a coherent movie by splicing 120 one-minute clips of Sora together, because the models wouldn't respond to cues in exactly the same way—continuity would be impossible.

But time constraints are no obstacle for Sora and similar programs, which can be used to transform TikTok, Reels and other social platforms.

In the past, in order to make a professional film, you needed very expensive equipment, but models like this will allow ordinary people making videos on social media to create very high-quality content.

  Given the pace of its progress, it’s not crazy to imagine that within a few months AI models will be able to create complex multi-scene, multi-character videos that are five to ten minutes long.

However, there’s still a long way to go from isolated editing to creating a medium that operates as a story so that viewers don’t become disconnected from it while watching.

Sora won't disrupt the film industry unless it becomes an open-source application that gives creators complete customization and control.

But it's clear that the technology can speed up the work of experienced filmmakers while completely replacing less experienced digital artists.

  Another often-mentioned industry that may be similarly disrupted is video games.

As the OpenAI paper states, "Sora can control players in Minecraft (a video game) using basic strategies while rendering the world and its dynamics with high fidelity."

Clearly, this is just the beginning of its gaming potential.

Future video game consoles may use diffusion technology to generate interactive video streams in real time, rather than having artists render billions of polygons by hand.

  Some have speculated that Sora was trained on video game engines, specifically Epic Games' Unreal Engine 5.

While Sora almost certainly won't use a video game engine to create its immersive feel, video game worlds may be used to help train Sora's underlying model.

Some Sora demos do look very similar to existing video game worlds.

Game developers are already being hit by layoffs in 2023, and Sora could spell further disaster for them.

Of course, it can also significantly lower the barrier to entry.

  Overall, the core of Sora is a multi-faceted artificial intelligence system that can understand and perform tasks across different fields.

Unlike previous models that were specialized in specific tasks such as text generation, image recognition, or strategy games, Sora aims to bridge these capabilities to provide a more comprehensive approach.

This is achieved through cutting-edge techniques in machine learning, including deep learning, reinforcement learning and transfer learning, which enable Sora to leverage knowledge gained in one area to improve performance in another.

  One of the most striking aspects of Sora is its adaptability.

OpenAI emphasizes the importance of creating artificial intelligence systems that can learn from minimal input and easily adapt to new challenges.

Sora embodies this principle, demonstrating the ability to understand context, generate relevant responses and even learn from interactions.

This adaptability not only enhances Sora's performance across a variety of tasks, it also reduces the need for extensive retraining, making it a more efficient and cost-effective solution for artificial intelligence applications.

2024: It is no longer possible to distinguish between artificial intelligence and reality

  However, no matter how amazing Sora was, almost no one outside the company had tried it—always a red flag.

  In a sense, OpenAI could be renamed CloseAI. Although its products are powerful enough to subvert our view of the world, no one tells us what the inner workings of the product are.

People outside the company haven't had a chance to study or test Sora to see how it's built, and comparisons with previous products aren't possible.

We just know that, similar to large language models, the more computing power OpenAI injects into Sora, the higher the quality of its output.

  But where does its training data come from?

The company was vague.

A spokesperson said only that the model was trained on "licensed and publicly available content"; when asked about potential harms, the spokesperson said the company was still working to address "misinformation, hateful content and bias."

All of this, like the advent of ChatGPT, raises all-too-familiar but serious concerns about deepfakes, copyright infringement, artist livelihoods, and hidden bias.

  OpenAI said, "We draw inspiration from large language models to achieve general capabilities by training on Internet-scale data."

“Drawing inspiration” is the only evasive reference to the source of Sora’s training data.

In the paper, OpenAI further states that "training text-to-video generation systems requires a large number of videos and corresponding text descriptions."

The only source of much visual data can be found on the Internet, which also hints at Sora's origins.

  Previously, OpenAI faced a lawsuit for not paying for using New York Times articles to train GPT-2 and GPT-3.

Until now, the rationale for searching for training data from the entire Internet was that it was publicly available.

However, "publicly available" does not always equate to "public domain."

Are there artists, photographers, performers and filmmakers whose work was used to train Sora?

Do they allow their creative work to be used in this way?

  It looks like the new Sora is doing the same thing as the old GPT, but this time specifically for video.

As before, OpenAI is tight-lipped about the data it trains its models on.

  Shrouded in mystery, Sora may become an imagination engine, a film revolution, or a video machine.

But for now it's best to think of it as a provocation or an advertising campaign.

To a large extent, OpenAI is not releasing products, but creating myths.

All spectatorship by the public is similar to paparazzi behavior.

  So, while I'm very impressed with Sora, I'm not entirely convinced of the hype.

Need to wait until regular people have access to this tool because the public perception of Sora right now is very carefully curated.

OpenAI CEO Sam Altman himself and the company shared the best video in a press release.

They provide access to a small, carefully selected group of users.

Maybe think of these as a "great tech company product demo" and we don't know if the resulting video will be that good when we have such a tool.

  In this case, we can't help but worry about the safety and ethical considerations in Sora's construction.

A persistent problem is disinformation, such as deepfakes.

Like other technologies in generative AI, there’s no reason to believe that text-to-video won’t continue to improve at a rapid pace, bringing us ever closer to an era where it’s hard to tell the difference between real and fake.

Imagine if this technology, combined with AI-driven voice cloning, could open up a whole new path in building deepfakes of things people have never done before?

  Sora's videos still have some strange glitches when depicting complex scenes with lots of action, suggesting that deepfake videos of this type are still detectable.

However, in the long run, there will definitely be a situation where everything is mixed up.

As Sora uses AI-generated videos in 2024 to make the world almost impossible to distinguish between AI and reality, the Information Age has ended and the Age of Disinformation has officially begun.

  By 2030, most people will know that any video, any sound, or any statement can be faked using free artificial intelligence tools.

They generate untold amounts of fiction online every day, and their numbers will only increase in the years to come.

  We live in an age where the sum of human knowledge is almost entirely accessible from the gadgets in our pockets, but artificial intelligence threatens to poison that well.

This is nothing new - Sora isn't the first threat to the internet, and it won't be the last, but it may well be the most damaging yet.

  From a media literacy perspective, this will make validating any user-generated content extremely complex because now users can generate whatever content they want.

Since the entire world we live in now is post-truth, a lot of people are dedicated to spinning false narratives into stories.

Images are harder than text because you have to have working knowledge of Photoshop or similar software, which has a barrier to entry.

Video is a higher level of difficulty.

Creating fake videos takes a lot of time, expertise and money.

But with Sora and similar apps, now you just enter the prompt and get it.

  How will this change journalism?

I believe Sora enables agenda-setters of all stripes to generate far more content than they have in the past.

And the explosion of AI-generated content from marketers and influencers could effectively crowd out legitimate news and media outlets.

  What is deplorable is that people are not only unaware of such a terrifying future, but instead desperately cheer the arrival of each new wave of artificial intelligence technology.

New technologies always have natural eyeball appeal, and the pursuit of traffic by media of all sizes is nothing new.

However, amidst the trend, few people analyze the framework of artificial intelligence reporting.

Is anyone serious about clarifying how these technologies work?

Is there a convincing and robust response to some truly outrageous hype?

  What happens in the end?

What the public gets is a science fiction version of the AI ​​story, and ends up being left out of important discussions around ethics, use, and future work.

All this is intensifying the Hollywoodization of the understanding of artificial intelligence.

  (The author is a professor at the School of Journalism and Communication, Peking University)

  "China News Weekly" Issue 9, 2024

 Statement: The use of articles from China News Weekly must be authorized in writing.