How Spotify ran the largest Google Dataflow job ever for Wrapped 2019


In early December, Spotify launched its annual personalized Wrapped playlist with its users’ most-streamed sounds of 2019. That has become a bit of a tradition and isn’t necessarily anything new, but for 2019, it also gave users a look back at how they used Spotify over the last decade. Because this was quite a large job, Spotify gave us a bit of a look under the covers of how it generated these lists for its ever-growing number of free and paid subscribers.
It’s no secret that Spotify is a big Google Cloud Platform user. Back in 2016, the music streaming service publicly said that it was going to move to Google Cloud, after all, and in 2018, it disclosed that it would spend at least $450 million on its Google Cloud infrastructure in the following three years.
It was also back in 2018, for that year’s Wrapped, that Spotify ran the largest Google Cloud Dataflow job ever run on the platform, a service the company started experimenting with a few years earlier. “Back in 2015, we built and open-sourced a big data processing Scala API for Apache Beam and Google Cloud Dataflow called Scio,” Spotify’s VP of Engineering Tyson Singer told me. “We chose Dataflow over Dataproc because it scales with less operational overhead and Dataflow fit with our expected needs for streaming processing. Now we have a great open-source toolset designed and optimized for Dataflow, which in addition to being used by most internal teams, is also used outside of Spotify.”
For Wrapped 2019, which includes the annual and decadal lists, Spotify ran a job that was five times larger than in 2018 — but it did so at three-quarters of the cost. Singer attributes this to his team’s familiarity with the platform. “With this type of global scale, complexity is a natural consequence. By working closely with Google Cloud’s engineering teams and specialists and drawing learnings from previous years, we were able to run one of the most sophisticated Dataflow jobs ever written.”
Still, even with this expertise, the team couldn’t just iterate on the full data set as it figured out how to best analyze the data and use it to tell the most interesting stories to its users. “Our jobs to process this would be large and complex; we needed to decouple the complexity and processing in order to not overwhelm Google Cloud Dataflow,” Singer said. “This meant that we had to get more creative when it came to going from idea, to data analysis, to producing unique stories per user, and we would have to scale this in time and at or below cost. If we weren’t careful, we risked being wasteful with resources and slowing down downstream teams.”
To handle this workload, Spotify not only split its internal teams into three groups (data processing, client-facing and design, and backend systems), but also split the data processing jobs into smaller pieces. That marked a very different approach for the team. “Last year Spotify had one huge job that used a specific feature within Dataflow called “Shuffle.” The idea here was that having a lot of data, we needed to sort through it, in order to understand who did what. While this is quite powerful, it can be costly if you have large amounts of data.”
This year, the company’s engineers minimized the use of Shuffle by using Google Cloud’s Bigtable as an intermediate storage layer. “Bigtable was used as a remediation tool between Dataflow jobs in order for them to process and store more data in a parallel way, rather than the need to always regroup the data,” said Singer. “By breaking down our Dataflow jobs into smaller components — and reusing core functionality — we were able to speed up our jobs and make them more resilient.”
Singer attributes at least a part of the cost savings to this technique of using Bigtable, but he also noted that the team decomposed the problem into data collection, aggregation and data transformation jobs, which it then split into multiple separate jobs. “This way, we were not only able to process more data in parallel, but be more selective about which jobs to rerun, keeping our costs down.”
Many of the techniques the engineers on Singer’s teams developed are currently in use across Spotify. “The great thing about how Wrapped works is that we are able to build out more tools to understand a user, while building a great product for them,” he said. “Our specialized techniques and expertise of Scio, Dataflow and big data processing, in general, is widely used to power Spotify’s portfolio of products.”
In early December, Spotify launched its annual personalized Wrapped playlist with its users’ most-streamed sounds of 2019. That has become a bit of a tradition and isn’t necessarily anything new, but for 2019, it also gave users a look back at how they used Spotify over the last decade. Because…
Recent Posts
- The Handmaid’s Tale season 6: everything we know so far about the hit Hulu show’s return
- Nvidia confirms ‘rare’ RTX 5090 and 5070 Ti manufacturing issue
- I used NoteBookLM to help with productivity – here’s 5 top tips to get the most from Google’s AI audio tool
- Reddit is experiencing outages again
- OpenAI confirms 400 million weekly ChatGPT users – here’s 5 great ways to use the world’s most popular AI chatbot
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010