Skip links

Websites With Fact-Checked Content Should Block Scraping and build a European LLM

AI companies are negotiating license agreements with websites containing human-made, fact-checked content, but they should not fall for the temptation to receive some peanuts for their content. The coming AI-based infrastructure should not be run by American or Chinese big tech companies.

The company behind ChatGPT, OpenAI, has bought access to use the journalistic texts from the Associated Press (AP) dating back to 1985. The two companies have signed a license agreement that allows OpenAI, which is backed by Microsoft, to train its artificial intelligence on the media’s texts. This is the first agreement of its kind since a number of AI companies began training their chatbots on data from the internet last year without asking permission from the content producers.

ChatGPT and other chatbots need lots of text such as books, academic and news articles, and social media updates, and such content is valuable because it is human-generated, current, and fact-checked. That’s why AI companies are trying to make agreements with news media so that they can’t be blamed for stealing their content in the future. Many others than AP are in negotiations and are considering following suit. The question is whether it is a good idea to enter into license agreements with AI companies. These days, media outlets are already working on license agreements with search engines and social media that use parts of their content – years after they started distributing their content for free.

Should they bow do the same with AI?

No, I don’t think so, and I’m not alone. Many smart people point out that we should not repeat the failure of letting a few large companies control our digital infrastructures in the form of digital conversation spaces, as it has meant that they have swallowed up our time and personal data and have allowed the spread of hate and misinformation. At the same time, it’s probably peanuts that will come out of license agreements, which the current license agreements indicate.

The new OpenAI-AP agreement is, of course, secret as is the black box tech companies.

Instead of making deals with the AI companies, everybody with fact-checked content should block AI companies from scraping their content – just like The New York Times has done it. Then they should sue them for having stolen their content – some authors, photographers, illustrators, and others are taking legal action against AI companies. And this time, we should utilise and control our content (= gold) ourselves and ensure that the services do not discriminate or create inequality, that they comply with our legislation, human rights and do not abuse our attention or make us vulnerable in terms of security. A number of professors recommend that we – in Denmark – develop a public alternative to commercial language models. Even though OpenAI and Google are already steps ahead, it will be expensive and there will be many challenges. Because even if regulation is on the way, it will come late and will probably be insufficient, they write.

A Collaborative EU/Nordic Effort

However, it doesn’t make sense to do this in Denmark alone. We simply don’t have enough data to compete with the big players, even though we have a lot in, for example, the Royal Library and Infomedia, owned by the to big media houses.

But it makes sense for us in Denmark to sort out and make available all the data we have that can be used to train multilingual models. It is very important right now to make the right decisions before some media, for example, choose to go alone and sell their content to tech giants like AP did. One model is that everyone donates the data they have to a joint project in Denmark, but this is probably unrealistic when large American companies want to buy access.

Therefore, foundations, the state and the EU should support it, so that there is also a business model for the media.

In Sweden, they are building their own language model, partly with the help of money from a large foundation. In Germany, they are doing the same thing, funded by the German Ministry of Economic Affairs and under the auspices of the European project Gaia-X, so there is also an opportunity for other European companies to make use of these tools. In Norway, the biggest media group Schibsted has been involved in building a Norwegian language model for a couple of years.

We really should get together in the Nordics and the EU, where we share values, and do this on our own this time. We already know that the US-based ChatGPT promotes American values, according to a study from Copenhagen University.

Danish media, libraries and all others with fact-checked content should go together to prepare all data to be part of a Nordic and European language model. Yes, it can’t happen fast enough.

Part of this text was first published in the Danish Daily Politiken

Photo Allison Saeng, Unsplash.com