NEWS
24 July 2024

AI models fed AI-generated data quickly spew nonsense
인공지능 모델에 공급된 인공지능 생성 데이터는 빠르게 허튼 소리를 내뿜습니다

Researchers gave successive versions of a large language model information produced by previous generations of the AI — and observed rapid collapse.
연구자들은 대형 언어 모델의 연속 버전에 이전 세대의 AI가 생성한 정보를 제공했고 빠른 붕괴를 관찰했습니다.

Elizabeth Gibney

Elizabeth Gibney

View author publications

You can also search for this author in PubMed Google Scholar

Example images generated after iterative retraining increasing from 25% (top row) to 100% (bottom row) of SD-generated faces. — The increasingly distorted images produced by an artificial-intelligence model that is trained on data generated by a previous version of the model. Credit: M. Boháček & H. Farid/arXiv (CC BY 4.0)

Download PDF

Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
인공 지능 (AI) 모델을 AI로 생성된 텍스트로 훈련시키면 모델이 무의미한 내용을 생성하는 것으로 빠르게 이어지는 것으로 밝혀졌습니다. 이 식인 현상인 모델 붕괴는 대형 언어 모델의 개선을 중단시킬 수 있으며, 인간 기반 훈련 데이터가 고갈되고 AI로 생성된 텍스트가 인터넷에 보급되는 양이 증가함에 따라 발생할 수 있습니다.

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models
ChatGPT가 '어떻게 생각하는가'? 심리학과 신경과학이 AI 대형 언어 모델을 해부한다

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.
공동 저자인 영국 케임브리지 대학의 AI 연구원 자카르 쉼아일로프는 "메시지는 훈련 데이터에 무엇이 들어가는지에 대해 매우 신중해야 한다는 것"이라며 그렇지 않으면 "항상, 증명 가능하게, 문제가 발생할 것"이라고 말했다. 팀은 수학적 분석을 사용하여 모델 붕괴 문제가 정리되지 않은 데이터를 사용하는 모든 크기의 언어 모델뿐만 아니라 간단한 이미지 생성기 및 기타 유형의 AI에 영향을 미칠 가능성이 있는 것으로 나타냈다.

The researchers began by using an LLM to create Wikipedia-like entries, then trained new iterations of the model on text produced by its predecessor. As the AI-generated information — known as synthetic data — polluted the training set, the model’s outputs became gibberish. The ninth iteration of the model completed a Wikipedia-style article about English church towers with a treatise on the many colours of jackrabbit tails (see ‘AI gibberish’).
연구자들은 먼저 LLM를 사용하여 위키피디아와 유사한 항목을 생성한 후, 모델의 새로운 반복 학습을 이전 모델이 생성한 텍스트로 진행했습니다. AI가 생성한 정보인 합성 데이터가 훈련 세트를 오염시키면서 모델의 출력물이 횡설수설해졌습니다. 모델의 아홉 번째 반복은 잭래빗 꼬리의 다양한 색상에 대한 논문이 담긴 영국 교회 탑에 관한 위키피디아 스타일의 글을 완성했습니다 ('AI 횡설수설' 참조).

More subtly, the study, published in Nature¹ on 24 July, showed that even before complete collapse, learning from AI-derived texts caused models to forget the information mentioned least frequently in their data sets as their outputs became more homogeneous.
보다 섬세하게, 7월 24일에 Nature1에 발표된 연구는 완전한 붕괴 이전에도 AI 유도 텍스트로부터의 학습이 모델이 출력이 보다 균질해짐에 따라 데이터 세트에서 가장 적게 언급된 정보를 잊게 만들었다는 것을 보여 주었다.

Robo-writers: the rise and risks of language-generating AI
로보 라이터: 언어 생성 AI의 부상과 위험

This is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK.
옥스포드 대학에서 일한 연구 공동 저자인 일리야 슈마일로프는 "모든 그룹을 공정하게 대표하는 AI 모델을 만들 때 주의해야 하는 문제입니다. 낮은 확률 사건은 종족주의 그룹과 관련이 있기 때문입니다."라고 말했습니다.

“This is a fantastic paper,” says Julia Kempe, a computer scientist at New York University in New York City. Until now, many technology firms have improved their models by feeding them larger and larger amounts of data. But as human-produced content runs out, they are hoping to use synthetic data to keep improving. The study — a version of which first appeared on the arXiv preprint server in May 2023 — has spurred the AI community to try to find solutions to the problem, she says. “It’s been a call to arms.”
"이 논문은 훌륭하다," 뉴욕 대학의 컴퓨터 과학자인 줄리아 켐프가 말했다. 지금까지 많은 기술 기업들은 모델을 계속해서 더 많은 데이터로 훈련시켜 왔다. 그러나 인간이 생산하는 콘텐츠가 고갈되면서, 그들은 계속 발전하기 위해 합성 데이터를 사용하길 희망하고 있다. 이 연구는 2023년 5월에 arXiv 사전 인쇄 서버에 처음 등장한 버전의 것이며, 이에 따라 AI 커뮤니티는 문제에 대한 해결책을 찾으려고 노력하고 있다고 그녀는 말했다. "이것은 전쟁을 선포한 것이었다."

You are what you eat
당신은 당신이 먹는 것이다

Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.
언어 모델은 대부분 인터넷에서 스크랩한 방대한 양의 텍스트에서 토큰(단어 또는 단어 부분) 간의 연관성을 구축함으로써 작동합니다. 이들은 학습한 패턴에 기초하여 통계적으로 가장 가능성이 높은 다음 단어를 내놓음으로써 텍스트를 생성합니다.

AI gibberish AI 투석

The study authors trained their large language model on Wikipedia articles and trained successive generations of the model on the text produced by the previous version. Prompted to follow on from a paragraph of text from the Wikipedia entry on Grade I listed buildings in Somerset, the models output the following text. The first output from the model (generation 0) contains some errors, but the ninth generation spews complete gibberish.
연구 저자들은 대규모 언어 모델을 위키피디아 문서로 훈련시키고, 이전 버전에서 생성된 텍스트로 모델의 연속 세대를 훈련시켰습니다. 소멜셋에 있는 1등급 건물에 대한 위키피디아 항목에서의 텍스트 단락을 따르도록 유도된 모델은 다음 텍스트를 출력했습니다. 모델의 첫 번째 출력(세대 0)에는 일부 오류가 있지만, 아홉 번째 세대는 완전한 헛소리를 내뱉습니다.

Model generation 0

Revival architecture such as St. John’s Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of per- pendicular churches : those.

Model generation 9

architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-

To demonstrate model collapse, the researchers took a pre-trained LLM and fine-tuned it by training it using a data set based on Wikipedia entries. They then asked the resulting model to generate its own Wikipedia-style articles. To train the next generation of the model, they started with the same pre-trained LLM, but fine-tuned it on the articles created by its predecessor. They judged the performance of each model by giving it an opening paragraph and asking it to predict the next few sentences, then comparing the output to that of the model trained on real data. The team expected to see errors crop up, says Shumaylov, but were surprised to see “things go wrong very quickly”, he says.

Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.

ChatGPT broke the Turing test — the race is on for new ways to assess AI

The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality².

Synthetic data problems

Model collapse does not mean that LLMs will stop working, but the cost of making them will increase, says Shumailov.

As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe.

AI image generators often give racist and sexist results: can they be fixed?

How much synthetic data is used in training matters. When Shumailov and his team fine-tuned each model on 10% real data, alongside synthetic data, collapse occurred more slowly. And model collapse has not yet been seen in the ‘wild’, says Matthias Gerstgrasser, an AI researcher at Stanford University in California. A study by Gerstgrasser’s team found that when synthetic data didn’t replace real data, but instead accumulated alongside them, catastrophic model collapse was unlikely³. It is unclear what happens when a model trains on data produced by a different AI, rather than its own.

Developers might need to find ways, such as watermarking, to keep AI-generated data separate from real data, which would require unprecedented coordination by big-tech firms, says Shumailov. And society might need to find incentives for human creators to keep producing content. Filtering is likely to become important, too — for example, humans could curate AI-generated text before it goes back into the data pool, says Kempe. “Our work⁴ shows that if you can prune it properly, the phenomenon can be partly or maybe fully avoided,” she says.

Nature 632, 18-19 (2024)

doi: https://doi.org/10.1038/d41586-024-02420-7

References

Shumailov, I. et al. Nature 631, 755–759 (2024).
Article Google Scholar
Bohacek, M. & Farid, H. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.12202 (2023).
Gerstgrasser, M. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.01413 (2024).
Feng, Y. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2406.07515 (2024).

Download references

Reprints and permissions

Subjects

Latest on:

AI is vulnerable to attack. Can it ever be used safely?

Outlook 25 JUL 24

DeepMind hits milestone in solving maths problems — AI’s next grand challenge

News 25 JUL 24

AI models collapse when trained on recursively generated data

Article 24 JUL 24

AI is complicating plagiarism. How should scientists respond?

News Feature 30 JUL 24

Three ways AI is changing the 2024 Olympics for athletes and fans

News Explainer 25 JUL 24

AI is vulnerable to attack. Can it ever be used safely?

Outlook 25 JUL 24

AI is vulnerable to attack. Can it ever be used safely?

Outlook 25 JUL 24

DeepMind hits milestone in solving maths problems — AI’s next grand challenge

News 25 JUL 24

AI models collapse when trained on recursively generated data

Article 24 JUL 24

Jobs

Postdoctoral Researcher

We work on the discovery of game-changing electrocatalysts by employing human-AI collaboration. Reference: https://doi.org/10.1021/acscentsci.3c01009

Tokyo (JP)

National Institute for Materials Science (NIMS)
Faculty Positions at IISc Medical School

Indian Institute of Science (IISc), Bengaluru, India, IISc is India’s top-ranked research institution

Bangalore City, Bangalore (IN)

Indian Institute of Science (IISc)
Assistant Professor (Tenure Track) of Cellular Metabolism and Disease

The Department of Biology (D-BIOL, www.biol.ethz.ch) at ETH Zurich invites applications for the above-mentioned position.

Zurich, Switzerland

ETH Zurich
Global Talent Recruitment of Xinxiang Medical University in 2024

Top-notch talents, leading talents in science and technology, and young and middle-aged outstanding talents.

Xinxiang, Henan, China

Xinxiang Medical University
Faculty Positions at Great Bay University, China

We are now seeking outstanding candidates in Physics, Chemistry and Physical Sciences.

Dongguan, Guangdong, China

Great Bay University, China (GBU)

Search

Quick links

AI models fed AI-generated data quickly spew nonsense
인공지능 모델에 공급된 인공지능 생성 데이터는 빠르게 허튼 소리를 내뿜습니다

You are what you eat
당신은 당신이 먹는 것이다

AI gibberish AI 투석

Model generation 0

Model generation 9

Synthetic data problems

References

Subjects

Latest on:

Jobs

Postdoctoral Researcher

Faculty Positions at IISc Medical School

Assistant Professor (Tenure Track) of Cellular Metabolism and Disease

Global Talent Recruitment of Xinxiang Medical University in 2024

Faculty Positions at Great Bay University, China

You are what you eat당신은 당신이 먹는 것이다

AI gibberish AI 투석

Model generation 0

Model generation 9

Synthetic data problems

References

Related Articles

Subjects

Latest on:

Jobs

Postdoctoral Researcher

Faculty Positions at IISc Medical School

Assistant Professor (Tenure Track) of Cellular Metabolism and Disease

Global Talent Recruitment of Xinxiang Medical University in 2024

Faculty Positions at Great Bay University, China

You are what you eat
당신은 당신이 먹는 것이다