Google stated that it had seen unconfirmed reports about OpenAI using YouTube content without permission (Shutterstock)

OpenAI used more than a million hours of YouTube videos to train GPT-4, its latest and most advanced language model, sparking debate about the legal and ethical standards for using data in developing generative AI models, according to For a report from the New York Times.

This discovery underscores the significant challenge that AI companies face in obtaining high-quality training data for their models, pushing them into controversial territory regarding copyright laws and fair use claims.

Matt Bryant, a Google spokesman, told The Verge that the company has seen unconfirmed reports about OpenAI activity, adding that Google's terms of service prohibit unauthorized use or downloading of YouTube content.

Google itself also collects clips from YouTube, according to the report, and Bryant stated in this context that the company trained its models “on some YouTube content, in accordance with our agreements with content creators on the platform.”

The quest for large, diverse datasets to train these sophisticated models led OpenAI to look to use innovative methods to feed its algorithms.

According to the report, the company developed the “Whisper” model for transcribing audio content with the aim of facilitating the training of its foundational model, “GPT-4”, by taking advantage of huge amounts of YouTube content.

This behavior, driven by the need to maintain competitive advantage and enhance model performance, raises questions about the legality and ethics of using copyrighted material without prior and explicit permission from the platform that owns the content.

The dilemma of obtaining good training data is not limited to OpenAI only. It reflects a broader trend in this field, where the appetite of developers of artificial intelligence systems is close to exceeding the limits of the available resources of this data.

This has led to consideration of alternative strategies, including training models on “synthetic” data produced by those models themselves, or so-called “systematic learning,” which involves feeding models with high-quality data in a structured manner in the hope that they can make more intelligent connections between concepts using Much less information, but none of these strategies have been tested yet, as another report from the Wall Street Journal noted.

So, the only option for companies is to use whatever they can find, whether they have permission to do so or not, and based on the multiple lawsuits filed over the past year, this choice will further exacerbate the problems and issues between different technology companies.

Source: New York Times