Data dilemmas at the heart of GenAI


With its promise of delivering competitive advantage to organizations worldwide, generative AI (GenAI) is the topic on every business leader’s lips. What does it mean for their organization? What are the plans for its use? And how quickly can they be enacted?
To date, much of data-specific conversation that has accompanied the exponential rise of this technology has been focused on the logistics of collection. As such it has been mainly concerned with questions of compute power, infrastructure, storage, skills etc.
But GenAI’s move into the mainstream also raises a number of more fundamental questions around the ethics of data use – evolving the conversation from how we do this, to should we.
In this article, we’re going to examine three examples of emerging ethical dilemmas around data and GenAI, and consider their implications for companies as they map out their long-term AI approaches.
Chief Technology Officer at Zscaler.
Data dilemma 1: What data should you be using? i.e. the public vs. private debate
For all its promises, GenAI is only as good as the data sources you give it – the temptation therefore being for companies to use as much data as they have access to. However, it’s not as simple as that, raising issues around privacy, bias and inequality.
On the most basic level, you can split data into two general categories – public and private, with the former being far more objective and susceptible to bias than the latter (one could be described as what you want the world to see, the other as factual). But while private data might be more valuable as a result, it is also more sensitive and confidential.
In theory regulations like the AI Act should start to restrict the use of private data – and therefore take the decision out of companies’ hands – but in reality, some countries won’t distinguish between the two types. Because of this, regulations that are too tight are likely to have limited effectiveness, and disadvantage those who follow them – potentially leading their GenAI models to deliver inferior or biased conclusions.
The area of intellectual property (IP) is a good example of a similar regulatory situation – Western markets tend to stick to IP laws while Eastern markets don’t, meaning that Eastern markets can innovate far quicker than their Western counterparts. And it is not just other companies that could take advantage of this inequality of data use – cyber criminals are not going to stick to ethical AI usage and observing privacy laws when it comes to their attacks, leaving those who do effectively battling with one arm tied behind their backs.
So what is the incentive to do so?
Data dilemma 2: How long should you be keeping your data? i.e. GDPR vs. GenAI
GenAI models are trained on data sets, with the bigger the set the better the model and more accurate its conclusions. But these data sets also need to be stable – remove data and you are effectively removing learning material, which could change the conclusion the algorithm might draw.
Unfortunately, this is exactly what GDPR specifies companies must do – keeping data for only as long as is necessary to process it. So, what if GDPR tells you to delete older data? Or someone asks to be forgotten?
Apart from the financial and sustainability implications of having to retrain your GenAI model, in the example of a self-driving car, deleting data could have very real safety implications.
So how do you balance the two?
Data dilemma 3: How do you train GenAI to avoid the use of confidential data? i.e. Security vs. categorization
By law companies must secure their data – or face significant fines for failing to do so. However, in order to secure their data they first need to categorize or classify it – to know what they are working with and how to treat it as a result.
So far so simple, but given the huge volumes of data companies now create on a daily basis more and more are turning to GenAI to accelerate the categorization process. And this is where the difficulty sets in. Confidential data should be given the highest possible security classification – and kept well clear from any GenAI engines as a result.
But how can you train AI to classify confidential data and therefore avoid it, without showing it confidential data examples? With recent Zscaler research showing that only 46% of surveyed organizations globally have classified their data according to criticality, this is still a pressing issue for the majority.
Approaching GenAI with these dilemmas in mind
It is a lot to consider – and these are just three of many questions companies face when determining their GenAI approach. So, is there an argument to be made for just sitting back and waiting for others to set the rules? Or worse, ignore them at the expense of being able to move more quickly with their GenAI implementations?
In answering this I believe we have a lot to learn from the way in which companies have evolved their approach to their carbon footprint. While there is growing legislation around this, it has taken many years to reach this point – and I’d imagine the same will be true for GenAI.
In the case of carbon footprints, companies have ended up being the ones to determine and govern their approach – but based largely on pressure from customers. Much in the same way that customers have started altering their buying habits to reflect a brand’s ‘green credentials’ we can expect them to penalize companies for unethical use of AI.
Given this, how should companies start taking charge of their GenAI approach?
1. Tempting as it might be, keep public and private data strictly separate and protect your use of private data as much as possible. Competitively this might be to your detriment, but ethically it is far too dangerous not to.
2. Extend this separation of data types to your AI engines – consider private AI for private data sources internally and do not expose private data to public AI engines.
3. Bear bias in mind – restrict AIs which conclude based on biased public information and do not verify their content. Validate your own results.
4. Existing regulations must take priority – ensure GDPR rules and “right to be forgotten” practices are observed. This will mean considering how often to reapply your AI processing engine and factoring this into plans and budgets.
5. Consider the use of a pre-trained AI model or synthetic data sets to both stabilize your model and avoid the question of confidential classification training.
6. Protect your private data sources at all costs – don’t let human task simplification (such as data categorization) be the unwitting pathway to AI data leaks. Sometimes the answer isn’t GenAI.
7. Extend your private data protection to employees – establish guidelines for GenAI, including training around which data is permitted to be uploaded to the tools and safe usage.
The need to act now
The pressure is on organizations – or more accurately their IT and security departments – to lock their approaches asap so they can leverage GenAI to their advantage.
Indeed, our research shows 95% of organizations are already using GenAI tools in some guise – and that is despite security concerns like those mentioned above – and 51% expect their use of GenAI to increase significantly between now and Christmas.
But they need to find ways of doing so without compromising the dilemmas we’ve introduced above. To hark back to our carbon footprint comparison, you don’t have to have all the answers in place to start making moves – but you do need to show you are at least trying to do the right thing from the outset and beyond.
We’ve featured the best business VPN.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro
With its promise of delivering competitive advantage to organizations worldwide, generative AI (GenAI) is the topic on every business leader’s lips. What does it mean for their organization? What are the plans for its use? And how quickly can they be enacted? To date, much of data-specific conversation that has…
Recent Posts
- One of the best AI video generators is now on the iPhone – here’s what you need to know about Pika’s new app
- Apple’s C1 chip could be a big deal for iPhones – here’s why
- Rabbit shows off the AI agent it should have launched with
- Instagram wants you to do more with DMs than just slide into someone else’s
- Nvidia is launching ‘priority access’ to help fans buy RTX 5080 and 5090 FE GPUs
Archives
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- September 2018
- October 2017
- December 2011
- August 2010