Where does the Artificial Intelligence machinery learn from? Is it the same as what I knew to be ‘machine learning’? There was a team of engineers outside my office in the London Stock Exchange in the late 1990s, in an open plan area, all engaged in teaching machines ‘how to learn’ from its own outputs and a range of new inputs to achieve specific goals.
I hadn’t thought of this ‘what’s the difference’ question before. More is explained by Google, which sells cloud services. Linked, if it interests you.
Here is a bit of news, which caught my eye. These thoughts below were triggered. I would be interested to know what you think.
In a nutshell – what Reddit will now do.
Reddit, a US public conversation platform, has decided that it will start charging for access to its data, via its API, to LLMs which are using this data to ‘teach themselves’ how humans talk to each other. This is how OpenAI and other large language models (LLMs) are learning how to sound like a human being, when answering questions asked of it.
What is an API?
It stands for ‘Application Programming Interface’. In the context of APIs, the word Application refers to any computer program with a distinct function. Here, ‘Interface is, if you like, a contract of service between two bits of computer program. This contract defines how the two communicate with each other using requests and responses.
[An LLM – large language model – is a way to enable a computer to process natural language, allowing for the machine to understand what is being said or typed, and respond in a way that a human being might have done. It learns by doing, and also analysing and classifying a massive amount of ordinary people’s communication. As new input is provided, it is factored into what the machine already knows of what used to be called ‘use of English’ – its unique attribute is its speed of assimilation.]
Reddit content, it turns out, has been one of the chief sources of the massive data sets that LLMs need, to improve what they learn about how people talk to each other.
So where does this input data come from?
It is what we write. All of that content is stuff Reddit’s users have created—not Reddit. But once it is up on the platform, that content is being called “data” – and now it can be bought and sold.
What I already knew
So far, I thought that our responses help the platforms indirectly.
In other words, we respond to content, and that indicates our preferences – this enables platforms to sort us into audiences.
This is the OCEAN analysis – they categorise what we say into our dominant traits: Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism. These five traits are supposed to account for differences in both personality and decision making – crucially, buying decisions.
Advertisers can then select those ‘audiences’ – and the platform sells them access to these segments of potential buyers. In other words what we say on platforms shows the platform a way to assess our susceptibility to particular kinds of advertising.
But then, what is new?
What Reddit has just done – saying that these Large Language Models can now access its data, for a fee, shows that there is value in the actual words we write.
This now is a direct sale of the content we create by responding to questions and commenting on stuff on any platform.
What is it being ‘bought’ for? The machines, it turns out, need to read our spontaneously written conversations, to learn how to interpret what we will ask the AI in applications like this ChatGPT that we keep hearing about. They need us, much more than we need them.
It proves that what we produce isn’t only valuable in terms of our susceptibility to advertising but as content – teaching material.
In summary:
Our time and energy produce the content on Reddit. Because it is real human conversation, it has exchange value to a machine trying to learn the language as it is used, and Reddit (and most other platforms, it seems) actually are (or soon will be) pocketing the money, for giving them access to our words.
This is being called ‘hidden labour’ – someone else profiting from the words we wrote down.
Thieving, more like – others are saying.
On reflection
The early web forums before Reddit – I was around then, (I am thinking about those used by my friend Phil Harper, Astro-physicist in the late 1980s) made me think that our posts are our Intellectual Property. So, when you hit ‘post’, you grant the platform a license to show it to others, edit it (they did have Terms of Service) – you could read them for yourself somewhere in the web archives.
I was newly at work when GDPR came out. In the electricity supply industry, we took a look at our rules and concluded that EU users did have rights that the Americans didn’t – but on the whole, it was easier for us to treat everyone as if they had the same GDPR rights – that kept our consciences clear.
But largely, posting an opinion on the internet doesn’t make it public domain, far from it. It makes it viewable by the public, that’s all.
If you have a big Shell sign on a billboard, we can all see it, but Shell still owns their logo, right? And if we steal it, or use their wording on our own advertisement, we are still infringing copyright.
Does this extend to comments we make on a public forum?
I looked at this post; this says:
‘In the United Kingdom, the ‘Sweat of the brow’ doctrine is followed, which states that copyright protection is given to the work of an author in exchange for the labour, skill, or judgment put into it. The court ruled in the landmark judgment of University of London Press v. University Tutorial Press, that the expression of thought in writing or print need not be in a novel form but must originate from the author and his labour and skill must be put in’.
Labour and skill. Yup.
At first glance, it looks like straight-forward stealing to me. I am open to being told otherwise.
UK Government policy
If you want to know what the UK Government is doing towards becoming a key centre for AI development, I am happy to say, that it is proceeding cautiously.
Here is the debate in the UK parliament, if you want to have a feel for the tone of these considerations.
Personally, I felt slightly reassured after I read this.