The rise of AI has created new challenges for most industries. In the ever-present pursuit of expense cutting the use of pre-trained AI is rapidly reshaping all industries.
As Mercedes-Benz’s CEO, Ola Källenius explained to investors in April. “We are on a journey to also become a software company. We will put supercomputer-like performance into every single Mercedes.” (Forbes, May 31, 2023). This means involving cloud computing and AI throughout a product’s lifecycle.
To achieve this the AI model needs to be trained using neural networks- a type of machine learning algorithm. The content used to train the algorithms is frequently publicly available internet materials which according to OpenAI is ‘fair use, as supported by long-standing and widely accepted precedents. But what does this ‘fair use’ really mean?
In the United States the ‘fair use’ doctrine in intellectual property law allows creators to build upon copyrighted material, provided the new product does not compete in the same market industry, as the copyrighted content.
The EU went further and AI act directly states in Recital 60k:
‘Providers of such models draw up and make publicly available a sufficiently detailed summary of the content used for training the general purpose model.’
This means that the EU requires AI developers to provide information about the content used to train the AI.
AI regurgitates data
In December 2023 The New York Times filed a lawsuit against OpenAI and Microsoft accusing them of copyright infringement. The lawsuit holds that ChatGPT and Microsoft Bing can produce content that bears a striking resemblance to the Times articles.
The Times lawsuit is without precedent and proves copyrighted data is being used to train AI models which in turn causes the AI to regurgitate outputs. Data including famous artworks, pop-culture works, and of course lines and lines of code. For this fear of unwillingly sharing company confidential information with generative AI or losing fragments of source code, many leading tech companies have banned ChatGPT from their employees’ computers.
The list of companies includes Apple, Samsung, Verizon, and Citigroup among many others.
The Apple case is particularly interesting because the ban was imposed on the same day as the ChatGPT app launch on IOS.
This move proves that there is a significant risk that the fragments of code you have posted anywhere in the internet or other intellectual property like designs, and technical descriptions might resurface as AI output and be used to develop a more competitive product. If tech giants are refraining from adopting generative AI inside their networked infrastructure why should we not?
When referring to the Times lawsuit OpenAI claimed it was the way the AI was instructed to produce a certain output, that was faulty, not the AI itself.
‘Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.’- OpenAI
It is interesting to note that OpenAi admits that ChatGPT can regurgitate content that bears a striking resemblance to copyrighted content even though it might appear as it is ‘cherry-picked’. The fact remains.
The above quote also proves the importance of securing copyrighted material. Keeping it out of the public domain seems as important as having appropriate legislative tools penalizing unauthorized use. However, as the legislative process
Does it mean your code or pcb blueprint you posted on the internet might be used to train an AI model or reproduced by AI?
When a leading AI researcher admits that data regurgitating is not a typical behavior, it implies that the possibility of this event needs a novel approach to it. That is why standard security means, that are used to keep prototype data or other confidential information secure, are not enough. Designers or programmers should thus refrain from using their copyrighted data with AI. However, this seems radical and increasingly impossible. At least unless adequate safeguards against instructing AI models to regurgitate data are more reliable.
Finally, governments and the international community should continually work on proper legislation and attempt to regulate the AI industry. The AI revolution is impossible to stop but should at least be under constant surveillance.
According to the “Time” article from November 2023
Sources:
https://www.nytimes.com/2023/12/30/business/media/copyright-law-ai-media.html
https://openai.com/index/openai-and-journalism/