Biased and hallucinatory AI models can produce inequitable results


“Code me a treasure-hunting game.” “Cover ‘Gangnam Style” by Psy in the style of Adele.” “Create a photorealistic, closeup video of two pirate ships battling each other as they sail inside a cup of coffee.” Even that final prompt is no exaggeration – today’s best AI tools can create all these and more in minutes, making AI seem like a real-world type of modern-day magic.
We know, of course, that it isn’t magic. In fact, a huge amount of work, instruction and information go into the models that power GenAI and produce its output. AI systems need to be trained to learn patterns from data: GPT-3, the base model of ChatGPT, was trained on 45TB of Common Crawl data, the equivalent of around 45 million 100-page PDF documents. In the same way that we humans learn from experience, training helps AI models to better understand and process information. Only then can they make accurate predictions, perform important tasks and improve over time.
This means that the quality of information we input into our tools is crucial. So, how can we make sure we foster quality data to build practical, successful AI models? Let’s take a look.
COO of Northern Data Group.
The risks of poor data
Good quality data is accurate, relevant, complete, diverse and unbiased. It’s the backbone behind effective decision-making, strong operational processes and, in this case, valuable AI outputs. Yet maintaining good quality data is tough. One survey by a data platform found that 91% of professionals say data quality has an impact on their organization, with only 23% characterizing good data quality as part of their organizational ethos.
Poor data often also contains limited and incomplete information that fails to accurately reflect the wider world. The resulting biases can impact how the data is collected, analyzed and interpreted, and lead to unfair or even discriminatory outcomes. When Amazon built an automated hiring tool in 2014 to help speed up its recruitment process, the software team fed it data about the company’s current pool of – overwhelmingly male – software engineers. The project was scrapped after just a year, when it became clear that the tool systematically discriminated against female applicants. Another example is Microsoft’s now-canceled Tay chatbot, which became notorious for making offensive remarks on social media because of the poor data it was trained on.
Returning to AI, messy or biased data can have a similarly catastrophic effect on a model’s productivity. Feeding jumbled data or poor-quality synthetic data into an AI model, and expecting it to offer up clear, actionable insights is futile; like microwaving a bowl of alphabet spaghetti and expecting it to come out spelling “The quick brown fox jumps over the lazy dog.” Data readiness, the state of preparedness and quality of data within an organization, is therefore a key hurdle to overcome.
Correctly feeding AI model
Research shows that when it comes to global companies’ AI strategies, just 13% are ranked as pacesetters in terms of data readiness. Meanwhile, 30% are classed as chasers, 40% as followers and a worryingly large 17% as laggards. These numbers must change if data is to power successful AI outcomes worldwide. To ensure good data readiness, we need to gather comprehensive and relevant data from reliable sources, clean it to remove errors and inconsistencies, accurately label it and standardize its formats and scales. Most importantly, we need to continuously check and update the data to maintain its quality.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
To begin, businesses must create a centralized data catalog, which incorporates data from various repositories and silos into one organized location. They should then classify and curate this data to make it easy to find, use and highlight contextual business information. Next, engineers must implement a strong data governance framework that incorporates regular data quality assessments. Data scientists should continuously detect and correct inconsistencies, errors, and missing values within datasets.
Finally, data lineage tracking involves the development of a clear understanding of the data’s origins, processing steps, and access points. This tracking ensures transparency and accountability in the case of a bad outcome. And it’s becoming particularly crucial in the face of greater concerns about AI’s privacy.
Making sure data is fair and secure
Today, personal AI queries are fast becoming the new confidential Google search. But there’s no way that users would trust them with private information if they knew it would be shared or sold. According to Cisco research, 60% of consumers are concerned about how organizations are using their personal data for AI, while almost two-thirds (65%) have already lost some trust in organizations as a result of their AI use. So, aside from regulatory concerns, we all have an ethical and reputational responsibility to ensure watertight data privacy when we’re building and leveraging AI technology.
Privacy means making sure that the everyday individuals interacting with AI-based tools and systems – from healthcare patients to online shoppers – have control over their personal data and can relax knowing that it’s being used responsibly. Here, businesses should operate under a ‘privacy by design’ concept, in which their technology only collects data that’s strictly necessary, stores it safely and is transparent over its use.
A good option is to anonymize all the data you collect. That way, you can reuse it in further AI model training without compromising customer privacy. And, once you no longer require this data, you can delete it to remove the risk of any future breaches. This sounds simple, but it’s an oft-forgotten step that can save on significant stress, reputational damage – and even regulatory fines.
Keeping data sovereignty front of mind
Compliance with regulatory requirements is, of course, paramount for any organization. And data residency is a growing focus across the globe. In Europe, for example, GDPR stipulates that EU citizens’ data must reside in the European Economic Area. That means you or your cloud partner need data centers within the region – if you transfer data somewhere else, you risk breaching the law. Data residency is already a priority for regulators and users alike, and it will only come into greater focus as more regulations are rolled out worldwide.
For businesses, compliance means either purchasing data storage facilities in specific sites outright or partnering with a specialist provider that offers data centers in strategic locations. Just ask the World Economic Forum, which says that “the backbone of Sovereign AI lies in robust digital infrastructure.” Simply, data centers with high-performance computing capabilities, operating on policies that ensure data generated is stored and processed locally, are the foundation for the effective, compliant development and deployment of AI technologies worldwide. It isn’t quite magic – but the results can be equally as impressive.
We list the best AI chatbots for business.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro
“Code me a treasure-hunting game.” “Cover ‘Gangnam Style” by Psy in the style of Adele.” “Create a photorealistic, closeup video of two pirate ships battling each other as they sail inside a cup of coffee.” Even that final prompt is no exaggeration – today’s best AI tools can create all…
Recent Posts
- Quordle hints and answers for Wednesday, February 19 (game #1122)
- Facebook is about to mass delete a lot of old live streams
- An obscure French startup just launched the cheapest true 5K monitor in the world right now and I can’t wait to test it
- Google Meet’s AI transcripts will automatically create action items for you
- No, it’s not an April fool, Intel debuts open source AI offering that gauges a text’s politeness level
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010