Biased and hallucinatory AI models can produce inequitable results

August 15, 2024

“Code me a treasure-hunting game.” “Cover ‘Gangnam Style” by Psy in the style of Adele.” “Create a photorealistic, closeup video of two pirate ships battling each other as they sail inside a cup of coffee.” Even that final prompt is no exaggeration – today’s best AI tools can create all these and more in minutes, making AI seem like a real-world type of modern-day magic.

We know, of course, that it isn’t magic. In fact, a huge amount of work, instruction and information go into the models that power GenAI and produce its output. AI systems need to be trained to learn patterns from data: GPT-3, the base model of ChatGPT, was trained on 45TB of Common Crawl data, the equivalent of around 45 million 100-page PDF documents. In the same way that we humans learn from experience, training helps AI models to better understand and process information. Only then can they make accurate predictions, perform important tasks and improve over time. 

This means that the quality of information we input into our tools is crucial. So, how can we make sure we foster quality data to build practical, successful AI models? Let’s take a look.

Rosanne Kincaid-Smith

COO of Northern Data Group.

The risks of poor data

Good quality data is accurate, relevant, complete, diverse and unbiased. It’s the backbone behind effective decision-making, strong operational processes and, in this case, valuable AI outputs. Yet maintaining good quality data is tough. One survey by a data platform found that 91% of professionals say data quality has an impact on their organization, with only 23% characterizing good data quality as part of their organizational ethos.

Poor data often also contains limited and incomplete information that fails to accurately reflect the wider world. The resulting biases can impact how the data is collected, analyzed and interpreted, and lead to unfair or even discriminatory outcomes. When Amazon built an automated hiring tool in 2014 to help speed up its recruitment process, the software team fed it data about the company’s current pool of – overwhelmingly male – software engineers. The project was scrapped after just a year, when it became clear that the tool systematically discriminated against female applicants. Another example is Microsoft’s now-canceled Tay chatbot, which became notorious for making offensive remarks on social media because of the poor data it was trained on.

Returning to AI, messy or biased data can have a similarly catastrophic effect on a model’s productivity. Feeding jumbled data or poor-quality synthetic data into an AI model, and expecting it to offer up clear, actionable insights is futile; like microwaving a bowl of alphabet spaghetti and expecting it to come out spelling “The quick brown fox jumps over the lazy dog.” Data readiness, the state of preparedness and quality of data within an organization, is therefore a key hurdle to overcome.  

Correctly feeding AI model

Research shows that when it comes to global companies’ AI strategies, just 13% are ranked as pacesetters in terms of data readiness. Meanwhile, 30% are classed as chasers, 40% as followers and a worryingly large 17% as laggards. These numbers must change if data is to power successful AI outcomes worldwide. To ensure good data readiness, we need to gather comprehensive and relevant data from reliable sources, clean it to remove errors and inconsistencies, accurately label it and standardize its formats and scales. Most importantly, we need to continuously check and update the data to maintain its quality.  

To begin, businesses must create a centralized data catalog, which incorporates data from various repositories and silos into one organized location. They should then classify and curate this data to make it easy to find, use and highlight contextual business information. Next, engineers must implement a strong data governance framework that incorporates regular data quality assessments. Data scientists should continuously detect and correct inconsistencies, errors, and missing values within datasets.

Finally, data lineage tracking involves the development of a clear understanding of the data’s origins, processing steps, and access points. This tracking ensures transparency and accountability in the case of a bad outcome. And it’s becoming particularly crucial in the face of greater concerns about AI’s privacy.

Making sure data is fair and secure

Today, personal AI queries are fast becoming the new confidential Google search. But there’s no way that users would trust them with private information if they knew it would be shared or sold. According to Cisco research, 60% of consumers are concerned about how organizations are using their personal data for AI, while almost two-thirds (65%) have already lost some trust in organizations as a result of their AI use. So, aside from regulatory concerns, we all have an ethical and reputational responsibility to ensure watertight data privacy when we’re building and leveraging AI technology.

Privacy means making sure that the everyday individuals interacting with AI-based tools and systems – from healthcare patients to online shoppers – have control over their personal data and can relax knowing that it’s being used responsibly. Here, businesses should operate under a ‘privacy by design’ concept, in which their technology only collects data that’s strictly necessary, stores it safely and is transparent over its use.

A good option is to anonymize all the data you collect. That way, you can reuse it in further AI model training without compromising customer privacy. And, once you no longer require this data, you can delete it to remove the risk of any future breaches. This sounds simple, but it’s an oft-forgotten step that can save on significant stress, reputational damage – and even regulatory fines.

Keeping data sovereignty front of mind

Compliance with regulatory requirements is, of course, paramount for any organization. And data residency is a growing focus across the globe. In Europe, for example, GDPR stipulates that EU citizens’ data must reside in the European Economic Area. That means you or your cloud partner need data centers within the region – if you transfer data somewhere else, you risk breaching the law. Data residency is already a priority for regulators and users alike, and it will only come into greater focus as more regulations are rolled out worldwide.

For businesses, compliance means either purchasing data storage facilities in specific sites outright or partnering with a specialist provider that offers data centers in strategic locations. Just ask the World Economic Forum, which says that “the backbone of Sovereign AI lies in robust digital infrastructure.” Simply, data centers with high-performance computing capabilities, operating on policies that ensure data generated is stored and processed locally, are the foundation for the effective, compliant development and deployment of AI technologies worldwide. It isn’t quite magic – but the results can be equally as impressive.

We list the best AI chatbots for business.

This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Source

“Code me a treasure-hunting game.” “Cover ‘Gangnam Style” by Psy in the style of Adele.” “Create a photorealistic, closeup video of two pirate ships battling each other as they sail inside a cup of coffee.” Even that final prompt is no exaggeration – today’s best AI tools can create all…