Data quality: The unseen villain of machine learning


What are the main things a modern machine learning engineer does?
This seems like an easy question with a simple answer:
Build machine learning models and analyze data.
In reality, this answer is often not true.
Efficient use of data is essential in a successful modern business. However, transforming data into tangible business outcomes requires it to undergo a journey. It must be acquired, securely shared and analyzed in its own development lifecycle.
The explosion of cloud computing in the mid-to-late 2000s and enterprise adoption of machine learning a decade later effectively addressed the start and end of this journey. Unfortunately, businesses often encounter obstacles in the middle stage relating to data quality, which typically is not on the radar of most executives.
Solutions consultant at Ataccama.
How poor data quality affects businesses
Poor quality, unusable data is a burden for those at the end of the data’s journey. These are the data users who use it to build models and contribute to other profit-generating activities.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Too often, data scientists are the people hired to “build machine learning models and analyze data,” but bad data prevents them from doing anything of the sort. Organizations put so much effort and attention into getting access to this data, but nobody thinks to check if the data going “in” to the model is usable. If the input data is flawed, the output models and analyses will be too.
It is estimated that data scientists spend between 60 and 80 percent of their time ensuring data is cleansed, in order for their project outcomes to be reliable. This cleaning process can involve guessing the meaning of data and inferring gaps, and they may inadvertently discard potentially valuable data from their models. The outcome is frustrating and inefficient as this dirty data prevents data scientists from doing the valuable part of their job: solving business problems.
This massive, often invisible cost slows projects and reduces their outcomes.
The problem worsens when data clean up tasks are performed in repetitive silos. Just because one person noticed and cleaned up a problem in one project doesn’t mean they’ve sorted the issue for all their colleagues and their respective projects.
Even if a data engineering team can undertake a mass clean up, they may not be able to do so instantly and they may not fully understand the context of the task and why they’re doing it.
The impact of data quality on machine learning
Clean data is particularly important for machine learning projects. Whether classifications or regressions, supervised or unsupervised learning, deep neural networks, or when an ML model enters new production, its builders must constantly evaluate against new data.
A crucial part of the machine learning lifecycle is managing data drift to ensure the model remains effective and continues to provide business value. Data is an ever-changing landscape, after all. Source systems may be merged after an acquisition, new governance may come into play or the commercial landscape can change.
This means previous assumptions of the data may no longer hold true. While tools like Databricks/MLFlow, AWS Sagemaker or Azure ML Studio cover model promotion, testing and retraining effectively, they are less equipped to investigate what part of the data has changed, why it has changed and then rectifying the issues, which can be tedious and time-consuming.
Being data-driven prevents these problems arising in machine learning projects, but it’s not just about the technical teams building pipelines and models; it requires the entire company to be aligned. Examples of how this would practically arise include where data might require a business workflow with somebody to approve it, or where a front-office, non-technical stakeholder contributes knowledge at the start of the data journey.
The roadblock to building ML models
The inclusion of business users as customers of their organization’s data is increasingly possible with AI. Natural language processing enables non-technical users to query data and extract insights contextually.
The expected growth rate of AI between 2023 and 2030 is 37 percent. 72 percent of executives see AI as the main business advantage and 20 percent of EBIT for AI-mature companies will be generated by AI in the future.
Data quality is the backbone of AI. It enhances the performance of algorithms and enables them to produce dependable forecasts, recommendations and classifications. For the 33 percent of companies reporting failed AI projects, the reason is due to poor data quality. In fact, organizations that pursue data quality are able to drive higher AI effectiveness all around.
But data quality isn’t just a box you can tick off. Organizations that make it an integral part of their operations are able to reap tangible business outcomes from generating more machine learning models per year to more reliable, predictable business outcomes by delivering trust in the model.
How to overcome data quality barriers
Data quality shouldn’t be a case of waiting for an issue to occur in production and then scrambling to fix it. Data should be constantly tested, wherever it lives, against an ever-expanding pool of known problems. All stakeholders should contribute and all data must have clear, well-defined data owners. So, when a data scientist is asked what they do, they can finally say: build machine learning models and analyze data.
We list the best business cloud storage.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro
What are the main things a modern machine learning engineer does? This seems like an easy question with a simple answer: Build machine learning models and analyze data. In reality, this answer is often not true. Efficient use of data is essential in a successful modern business. However, transforming data…
Recent Posts
- Elon Musk says Grok 2 is going open source as he rolls out Grok 3 for Premium+ X subscribers only
- FTC Chair praises Justice Thomas as ‘the most important judge of the last 100 years’ for Black History Month
- HP acquires Humane AI assets and the AI pin will suffer a humane death
- HP acquires Humane AI assets and the AI pin may suffer a humane death
- HP acquires Humane Ai and gives the AI pin a humane death
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010