If data is “The New Oil,” then what is AI?

Back when the term ‘big data’ was very new and extremely promising, we enthusiastically told each other, “data is the new oil!” The idea was that the raw material ‘data’ would lead to a huge boom in new and useful applications. That comparison might actually be more accurate than we thought.

In my favorite metaphor about AI, oil indeed compares directly with fuel for cars. Oil is also the basis for another innovation that might be even more ubiquitous in our daily lives: plastic.

Data is everywhere, and the amount keeps groing. Oil is everywhere too; or rather, its products are: CO₂ and plastic. We’ve already discussed carbon (effect of AI on emissions), so let’s focus on plastic here.

Plastic and AI slop

Plastic has brought us a lot (like food safety). But when plastic breaks down, you get microplastics. Microplastics are everywhere, also in places where we absolutely don’t want them. They have negative effects on our health, even though we don’t yet know exactly how much exactly. Even worse, these microplastics can’t be removed.

Microplastics can be compared to low-quality text, typically the kind of text produced by AI: AI slop.

I even encounter slop in the kitchen

(A quick aside. When I’m cooking and wonder how long green beans need to boil, I sometimes end up on sites like this. It requires quite some scrolling to get to the info that I was looking for. The AI summaries that many people seem to hate, are not a bad thing here!)

Health

What does AI slop do to our own mental health? There’s already talk of brain rot (but more on that in another blog). Apart from that, the enormous amount of text you can now generate is becoming a new problem for companies using AI.

AI slop is annoying for us, but it also does harmful things to the ‘health’ of AI itself. It turns out that if you train new AI models on texts generated by AI, a kind of inbreeding occurs. The technical term for this is model collapse..

Here’s how it works. When an AI model is created, it “learns” from the training data. The model tries to model the reality that is represented in that training data (hence the word ‘model’!). In doing so, you always lose some information: simplifications occur because all the data is lumped together statistically. In ‘real’ data, you sometimes find rare, exceptional, but still valid information that the model doesn’t reproduce – because it’s so rare and exceptional, of course. But what isn’t reproduced also doesn’t appear in the ‘new’ data that the AI model generates. Subsequent AI models trained on that data won’t pick it up either. It’s like copying a drawing on a copier, then copying the copy, and again, and again… until you’re left with a gray caricature of the original.

It turns out that some people are now collecting and storing “data from before 2022” . This “pre-AI data,” is like the radiation free steel collected for certain applications, as Ars Technica insightfully noted. This steel was produced before the first nuclear tests worldwide and is ‘clean’ because there was no radioactive contamination yet.

Just like with microplastics, it’s not easy to remove that slop from data. There are still no reliable ways to recognize AI slop.

Science itself isn’t immune either.

(It might be even worse. For many people, ‘science’ is synonymous with ‘truth.’ That’s a misconception: science tries to find the best possible explanation based on current understanding, but it can always be replaced by an even better idea. There are now scientific publications presenting results generated by AI. Other scientists use these publications as a basis for further work! The number of papers with AI traces increases. Especially ‘survey papers’ are are sensitive to this. These publications are, in principle, peer-reviewed, but if there are so many, you have to wonder if that’s still done properly. Will AI slop become a problem in science too?)

Money and geopolitics

Oil has a significant economic impact, and here too the analogy with data holds: it takes a lot of capital to actually cash in on the economic benefits of oil. You need drilling platforms and refineries. It’s the same with data: only when you have a huge amount of it, you can train a somewhat reliable language model. It’s the “winner takes all” model that was invented during the dotcom bubble.

One last similarity between oil and data is around colonialism. To some extent (not everywhere), oil was something to extract from other countries, usually in a way that didn’t benefit those countries much. “Data colonialism” is a term you see more often these days: the tendency of powerful companies (Big Tech) to grab people’s data for their own gain. Almost all language models are trained on data scraped from websites without the rights holders’ permission. It’s so similar to 17th century colonialism: go out and take what you can. Data, like oil, is deeply intertwined with geopolitics.

So what?

In short: the comparison that data is the new oil is not a bad one: the AI data centers of Big Tech can be compared to the refineries of Big Oil.

We don’t realize the huge impact of oil on our daily lives. There are definitely positive impacts, although the dark side is quite visible. We are still in a position to steer the development of AI in such a way that we minimize the dark side and focus on its positive impacts.

Posted in

Eén reactie op “If data is “The New Oil,” then what is AI?”

  1. […] If data is “The New Oil,” then what is AI? […]

    Like

Plaats een reactie