Investigation finds companies are training AI models with YouTube content without permission

Artificial intelligence models require as much useful data as possible to perform but some of the biggest AI developers are relying partly on transcribed YouTube videos without permission from the creators in violation of YouTube’s own rules, as discovered in an investigation by Proof News and Wired.
The two outlets revealed that Apple, Nvidia, Anthropic, and other major AI firms have trained their models with a dataset called YouTube Subtitles incorporating transcripts from nearly 175,000 videos across 48,000 channels, all without the video creators knowing.
The YouTube Subtitles dataset comprises the text of video subtitles, often with translations into multiple languages. The dataset was built by EleutherAI, which described the dataset’s goal as lowering barriers to AI development for those outside big tech companies. It’s only one component of the much larger EleutherAI dataset called the Pile. Along with the YouTube transcripts, the Pile has Wikipedia articles, speeches from the European Parliament, and, according to the report, even emails from Enron.
However, the Pile has a lot of fans among the major tech companies. For instance, Apple employed the Pile to train its OpenELM AI model, while the Salesforce AI model released two years ago trained with the Pile and has since been downloaded more than 86,000 times.
The YouTube Subtitles dataset encompasses a range of popular channels across news, education, and entertainment. That includes content from major YouTube stars like MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News set up a search tool that will search through the collection to see if any particular video or channel is in the mix. There are even a few TechRadar videos in the collection, as seen below.
Secret Sharing
The YouTube Subtitles dataset seems to contradict YouTube’s terms of service, which explicitly fobird automated scraping of its videos and associated data. That’s exactly what the dataset relied on, however, with a script downloading subtitles through YouTube’s API. The investigation reported that the automated download culled the videos with nearly 500 search terms.
The discovery provoked a lot of surprise and anger from the YouTube creators Proof and Wired interviewed. The concerns about the unauthorized use of content are valid, and some of the creators were upset at the idea their work would be used without payment or permission in AI models. That’s especially true for those who found out the dataset includes transcripts of deleted videos, and in one case, the data comes from a creator who has since removed their entire online presence.
Sign up for breaking news, reviews, opinion, top tech deals, and more.
The report didn’t have any comment from EleutherAI. It did point out that the organization describes its mission as democratizing access to AI technologies by releasing trained models. That may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility for AI, but producing it will be a lot harder.
You might also like
Artificial intelligence models require as much useful data as possible to perform but some of the biggest AI developers are relying partly on transcribed YouTube videos without permission from the creators in violation of YouTube’s own rules, as discovered in an investigation by Proof News and Wired. The two outlets…
Recent Posts
- Google may be close to launching YouTube Premium Lite
- Someone wants to sell you a digital version of the antiquated typewriter but without a glued-on keyboard (no really)
- Carbon removal is the next big fossil fuel boom, oil company says
- This is probably the best looking docking station I’ve ever seen in my entire life – and I can’t wait to test it
- Fitbit’s got a battery problem
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010