Scale AI releases free lidar dataset to power self-driving car development

High quality data is the fuel that powers AI algorithms. Without a continual flow of labeled data, bottlenecks can occur and the algorithm will slowly get worse and add risk to the system.
It’s why labeled data is so critical for companies like Zoox, Cruise and Waymo, which use it to train machine learning models to develop and deploy autonomous vehicles. That need is what led to the creation of Scale AI, a startup that uses software and people to process and label image, lidar and map data for companies building machine learning algorithms. Companies working on autonomous vehicle technology make up a large swath of Scale’s customer base, although its platform is also used by Airbnb, Pinterest and OpenAI, among others.
The COVID-19 pandemic has slowed, or even halted, that flow of data as AV companies suspended testing on public roads — the means of collecting billions of images. Scale is hoping to turn the tap back on, and for free.
The company, in collaboration with lidar manufacturer Hesai, launched this week an open source dataset called PandaSet that can be used for training machine learning models for autonomous driving. The dataset, which is free and licensed for academic and commercial use, includes data collected using Hesai’s forward-facing PandarGT lidar with image-like resolution as well as its mechanical spinning lidar known as Pandar64. The data was collected while driving urban areas in San Francisco and Silicon Valley before officials issued stay-at-home orders in the area, according to the company.
“AI and machine learning are incredible technologies with an incredible potential for impact, but also a huge pain in the ass,” Scale CEO and co-founder Alexandr Wang told TechCrunch in a recent interview. “Machine learning is definitely a garbage in, garbage out kind of framework — you really need high quality data to be able to power these algorithms. It’s why we built Scale and it’s also why we’re using this dataset today to help drive forward the industry with an open source perspective.”
The goal with this lidar dataset was to give free access to a dense and content-rich dataset, which Wang said was achieved by using two kinds of lidars in complex urban environments filled with cars, bikes, traffic lights and pedestrians.
“The Zoox and the Cruises of the world will often talk about how battle-tested their systems are in these dense urban environments,” Wang said. “We wanted to really expose that to the whole community.”
The dataset includes more than 48,000 camera images and 16,000 LiDAR sweeps — more than 100 scenes of 8s each, according to the company. It also includes 28 annotation classes for each scene and 37 semantic segmentation labels for most scenes. Traditional cuboid labeling, those little boxes placed around a bike or car, for instance, can’t adequately identify all of the lidar data. So, Scale uses a point cloud segmentation tool to precisely annotate complex objects like rain.
Open sourcing AV data isn’t entirely new. Last year, Aptiv and Scale released nuScenes, a large-scale data set from an autonomous vehicle sensor suite. Argo AI, Cruise and Waymo were among a number of AV companies that have also released data to researchers. Argo AI released curated data along with high-definition maps, while Cruise shared a data visualization tool it created called Webviz that takes raw data collected from all the sensors on a robot and turns that binary code into visuals.
Scale’s efforts are a bit different; For instance, Wang said the license to use this dataset doesn’t have any restrictions.
“There’s a big need right now and a continual need for high quality labeled data,” Wang said. “That’s one of the biggest hurdles overcome when building self driving systems. We want to democratize access to this data, especially at a time when a lot of the self driving companies can’t collect it.”
That doesn’t mean Scale is going to suddenly give away all of its data. It is, after all a for-profit enterprise. But it’s already considering collecting and open sourcing fresher data later this year.
High quality data is the fuel that powers AI algorithms. Without a continual flow of labeled data, bottlenecks can occur and the algorithm will slowly get worse and add risk to the system. It’s why labeled data is so critical for companies like Zoox, Cruise and Waymo, which use it…
Recent Posts
- Fortnite’s new season has heists, pickles, and Cowboy Bebop
- The best microSD cards in 2025
- I tried this new online AI agent, and I can’t believe how good Convergence AI’s Proxy 1.0 is at completing multiple online tasks simultaneously
- I cannot describe how strange Elon Musk’s CPAC appearance was
- Over a million clinical records exposed in data breach
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010