AI Expert Delves Into the Ethics Concerning the Sourcing of Data that 'Feeds' LLM Models
(By Columnist Jamie Munro, AI & Robotics Expert)
The last few years have seen Artificial Intelligence break out increasingly into mainstream usage. Its usefulness isn’t the question. The questions, at the user level, revolve around how to employ it for greatest effect (including competitively), and at the level of AI model training, how to advance AI models that ensure balanced, properly informed outputs from ethically-sourced inputs.
Executives are looking to drive efficiency improvements. Democratic governments are starting to use AI to predict voter outcomes while their counterparts in authoritarian regimes are using it to monitor citizen activities and crack down on dissent. Three in five doctors in the U.S. now report using Artificial Intelligence as part of their practice. Internal research at my firm (willowlearn.com) indicates that between 40 and 60 percent of teachers in the UK now use AI in some component of their role. The most eager early adopters of AI are university students - with the latest numbers from the UK's Higher Education Policy Institute showing 88 percent of university students admitting to the use of AI in their assignments.
You’re An AI User Whether You Know It or Not
You've almost certainly started to hear the names of various AI products coming up in conversations - names like ChatGPT, Gemini, Claude and Perplexity.
Even if you haven't deliberately used one of these products, you've definitely interacted with some sort of product or service employing AI in some way. If you've used Google over the last few months, you will have started noticing the "AI overview" section at the top of the results (congratulations – you are now an AI user).
So now that you are an AI user, you might be asking, what exactly is AI? To give a simple answer, “AIs” – in layman’s terms – are computer programs (“Large Language Models” – “LLMs”) that you can interact with using natural human language. Much like texting with a (very knowledgeable) friend. When you send a message to an LLM, it will respond back to you with its own message. What it's really doing is using some very complicated mathematics to predict, based on your input, what it is that you want to see, and then generating it for you. It's a bit of a simplification, but you can think of it as a beefed-up version of the predictive text system on your phone.
So How Are These Computer Programs Able to Predict What You Want to See?
This is where data comes in, and lots of it. AI models need to be shown millions (and trillions) of examples of human language in a process known as "training". This leads to another question - where does the data come from?
The answer to that question is sensitive.
Firstly, companies developing Artificial Intelligence LLMs for the marketplace need vast amounts of data to remain competitive. Many have been accused of employing unscrupulous methods to obtain it.
The issue of copyright around AI training materials is still an open question and probably one that can only be addressed with new legislation. It emerged earlier this year that Meta (Facebook’s parent company) used a large number of pirated books as part of its training data. That's just one of many copyright cases against AI giants currently making their way through the U.S. Courts system.
Copyright Dilemmas: Is It Fair to Content Creators & Other Humans?
Proponents of greater copyright protections would argue that AI fundamentally breaks the current monetisation systems for content creators.
Under current systems, content creators list their content on search engines like Google and YouTube. Users search for topics they are interested in, the search engine shows the user adverts (which is how they make money), then the user clicks on an item that interests them. The creator of that content then has the chance to monetise their readers/viewers.
As users move away from search engines and towards Artificial Intelligence, AI services simply provide an authoritative answer to the user’s query without linking to an external resource. Previously, a user searching for “how to change a tyre” might find their answer on a mechanic’s website and decide to purchase some service from that mechanic’s business. But in the future, they will increasingly ask an AI how to change the tyre, and the AI will just tell them - even if the AI originally learned how to do that from the mechanic's website.
Is that fair to the original content creator (the mechanic, in this case)?
Proponents of reduced copyright protections would argue that the above argument would never be applied to humans. if I spent a year reading books about a particular topic and became an expert, and then I wrote my own book about the topic, nobody would argue that I'm just regurgitating the books I read (unless I was found to have plagiarised large chunks without attribution). The other argument is economic: if we start enforcing copyright protections for the creators of AI training material, the cost of these already expensive AI models will get even higher – and countries that don't care about copyright protections (such as China) will gain a massive advantage.
Concentration of Control & Potential Misuse of Power
The second major data-related problem is one of control and power.
Everything an AI "knows" comes from its training data, so whoever decides what is included in the training data has immense power.
In a world where everyone gets their information from AI, the creators of AI will have massive influence over public opinion, much like the broadcasters and newspaper owners in the pre-social media world. Currently this power is concentrated in the hands of a few tech giants like Google, Meta and OpenAI.
Efforts to regulate these giants and AI more broadly could result in concentrating the power into even fewer hands.
AI Bias in Elections? And Why Do They All Sound the Same?
There have already been numerous accusations of bias made against major AI models; during the 2024 U.S. elections, for example.
Many users note that the major AI models, even ones from different developers, end up producing rather similar "facts". AI models are only as good as the training data and any biases in this data will certainly be reflected by the model. The similarities between model outputs can be explained if you consider that all the AI companies are broadly using the same training data – the contents of the internet.
The training data supplied to an AI model acts as a type of voting mechanism. The more times a particular idea or concept appears in the training data, the more likely the model is to reproduce it. This means that AI models are much more likely to adopt mainstream viewpoints, and less popular, more controversial or dissenting opinions and views are much less likely to be represented.
Are we heading towards a future where ordinary people only have access to the “official line”? And do we want the likes of Google, Meta and OpenAI deciding what that official line should be?
It’s too late to go back to a world without AI – that ship has already sailed. Countries, companies and individuals who fail to adopt AI will be left behind and out-competed by those who are already on board.
Humanity must bravely face the future and embrace the massive opportunities presented by Artificial Intelligence. But as we move towards that future, we cannot shy away from the issues introduced by AI. They must be tackled thoughtfully and they must be tackled head-on.
Artificial Intelligence is ultimately a technology built by humans, for humans, to make life better for us all. But whether the reality is a “for better or for worse” one, will be wholly dependent on whether we do address these fundamental and critical issues.
See Jamie Munro's full bio here.
___________________________
Recent Highlight Coverage:
How Wellington REALLY Works: The '5Ds' . . . and How Parliamentarians & Government Agencies Use These Against YOU
Other News, Reviews & Commentary
