This is a bilingual snapshot page saved by the user at 2024-9-28 10:26 for https://mp.weixin.qq.com/s/pWir8xgtp0oL8mp8Hc4t_Q, provided with bilingual support by Immersive Translate. Learn how to save?
cover_image

王小川评OpenAI o1: 找到一条从快思考走向慢思考的路
Wang Xiaochuan comments on OpenAI o1: Finding a path from fast thinking to slow thinking

张小珺 腾讯科技  Original Zhang Xiaojun Tencent Technology 2024年09月25日 00:01 September 25, 2024 00:01

Image

主笔:张小珺 Chief Writer: Zhang Xiaojun

编辑:石丁 Editor: Shi Ding
出品:腾讯新闻《潜望》 Produced by: Tencent News "Qianwang"
2024年9月,OpenAI预热已久的“Strawberry”(草莓)项目终于问世,它重置命名模式,没有沿用原有的GPT命名,而是取名为o1模型——业界认为,o1代表了AGI范式的一次大转移或者大升级。
In September 2024, OpenAI's long-awaited "Strawberry" project was finally launched. It reset the naming convention, not continuing with the original GPT naming, but instead named it the o1 model. The industry believes that o1 represents a major shift or upgrade in the AGI paradigm.
靠语言模型预训练的Scaling Law这个经典物理规律遇到瓶颈后,包括OpenAI在内的多家硅谷明星公司,已经把它们的资源重心押宝在一条新路径上:强化学习。o1的发布把以强化学习为基础的后训练,推到了焦点中心。
After the classic physical law of Scaling Law, which relies on language model pre-training, encountered a bottleneck, several star companies in Silicon Valley, including OpenAI, have bet their resources on a new path: reinforcement learning. The release of o1 has brought post-training based on reinforcement learning into the spotlight.
百川智能创始人兼CEO王小川很早就开始在公开演讲中谈论强化学习。他曾说,大模型代表快思考,它叫“学”;强化学习是慢思考,叫“思”。“学”和“思”两个系统最终会走向融合。
Baichuan Intelligence founder and CEO Wang Xiaochuan began discussing reinforcement learning in public speeches very early on. He once said that large models represent fast thinking, which he calls "learning"; reinforcement learning is slow thinking, which he calls "thinking." The two systems of "learning" and "thinking" will eventually converge.
在o1发布后,王小川在第一时间接受了我们的访谈。关于o1和强化学习,他有一些核心观点:
After the release of o1, Wang Xiaochuan was the first to accept our interview. He has some core views on o1 and reinforcement learning:
1、o1不代表范式转弯,而是范式升级,OpenAI摸到了一条从快思考走向慢思考的道路;
1、o1does not represent a paradigm shift, but a paradigm upgrade. OpenAI has found a path from fast thinking to slow thinking;
2、o1的关注重点在于两点:从以语言为核心走向思维链,更强调思维链(CoT,"Chain of Thought");把思考过程和给出结果,分两阶段运行,能够增加泛化性;
2. The focus of o1 is on two points: moving from language-centric to chain of thought (CoT), emphasizing the chain of thought; running the thinking process and giving results in two stages can increase generalization.
3、除了数学和代码,AI医生是可以用强化学习提升的领域;
3. Besides mathematics and code, AI doctors are a field that can be improved with reinforcement learning;
4、另外,他做了一个预言:代码会变成大模型下一步的核心能力,大模型通过写代码解决更多问题甚至是自身思考过程,未来几年将从强化学习范式走向写代码来解决问题的新范式。
4. Additionally, he made a prediction: code will become the core capability of the next step for large models. Large models will solve more problems by writing code, even their own thinking processes. In the coming years, the paradigm will shift from reinforcement learning to solving problems by writing code.
以下为王小川的访谈节选。(为了方便阅读,作者做了一些文本优化)
The following is an excerpt from an interview with Wang Xiaochuan. (For ease of reading, the author has made some text optimizations)
Image
o1摸到了一条从快思考走向慢思考的道路 o1 has found a path from fast thinking to slow thinking
腾讯新闻《潜望》:关于OpenAI做强化学习的经过,你了解哪些事实?
Tencent News "Qianwang": What facts do you know about OpenAI's process of doing reinforcement learning?
王小川:之前在Sam Altman被宫斗下课的时候,我理解并不是狗血的事。
Wang Xiaochuan: When Sam Altman was ousted due to internal conflicts, I understood it wasn't a scandalous matter.
它的董事会是又聪明、也挺正直的人,不会犯傻。有的人要么蠢,要么坏。但这些人既不蠢,也不坏。纷争背后一定有某些大家没发现的事。
Its board of directors consists of smart and upright people who wouldn't make foolish decisions. Some people are either stupid or malicious. But these people are neither stupid nor malicious. There must be something behind the disputes that everyone hasn't discovered.
在之前一周,我听到当时他们一个核心的人传出来:NoamBrown,之前是DeepMind做强化的一个大神加入OpenAI,并且号称跟其他朋友讲,他们有一些突破性的事情了。一周后,Sam Altman又出了点事。
A week earlier, I heard from a core member at that time: Noam Brown, who was previously a reinforcement learning expert at DeepMind, joined OpenAI and claimed to his friends that they had made some breakthrough achievements.A week later, Sam Altman had another issue.
前后是两个事。第一个,被宫斗,一定是技术上大家没看到的东西在内部有质疑,说Sam Altman比较激进,他对安全性(认识)不足,把安全重要性放在技术突破后面。有些突破性的技术可能不安全,但他有点无所谓,想尽快把技术突破做下去。恰巧Noam说他们有技术突破,而Noam代表强化学习。
These are two separate matters. First, the internal conflict must have been due to some technical issues that were not visible to everyone, leading to doubts within the organization. They said Sam Altman was relatively aggressive and didn't prioritize safety, placing the importance of safety behind technological breakthroughs. Some breakthrough technologies might be unsafe, but he didn't seem to care and wanted to push the technological advancements quickly. Coincidentally, Noam mentioned they had technological breakthroughs, and Noam represents reinforcement learning.
两个放一块,去年底我们推论出,OpenAI强化学习有一些突破了。
Putting these two together, we inferred at the end of last year that OpenAI had made some breakthroughs in reinforcement learning.
腾讯新闻《潜望》:今年初OpenAI发Sora,你们技术同事想跟进,被你摁死了。这次o1呢?
Tencent News "Qianwang": At the beginning of this year, OpenAI launched Sora, and your technical colleagues wanted to follow up, but you stopped them. What about this time with o1?
王小川:这个技术范式核心是语言模型,然后走向强化学习——这是智能提升的两个阶段。
Wang Xiaochuan: The core of this technical paradigm is the language model, then moving towards reinforcement learning—these are the two stages of intelligence enhancement.
当做Sora时,它既不是语言学习,也不是强化学习,就不在提升智力方向里,而是另一个独立产品。因此做Sora,从技术上并不代表智能的提升,场景上也不是百川所追求的。我当时认为,有同学这么想,属于“技术没想明白,场景也没想明白”。
When making Sora, it was neither language learning nor reinforcement learning, so it was not in the direction of enhancing intelligence, but rather another independent product. Therefore, making Sora does not technically represent an enhancement of intelligence, and the scenario is not what Baichuan pursues. At that time, I believed that if someone thought this way, it belonged to "not understanding the technology clearly, nor understanding the scenario clearly."
去年4月百川下场的时候,就在强调强化学习,以及强调多模态不是智能提升方向。
When Baichuan entered the field last April, it emphasized reinforcement learning and stressed that multi-modality is not the direction of intelligence enhancement.
我们说大模型是“读万卷书”,强化学习是“行万里路”。大模型本身叫“学而不思则罔”,它会读很多书,但内心偏混乱。单独强化学习有一个典型作品,是AlphaGo。我相信AlphaGo是对人工智能的启蒙,它是强化学习特别是自我博弈的一个典型代表。这个技术,之前DeepMind一直沿着这个路子走,走到后来发现它叫“思而不学则殆”,停在一个问题里面走不出来了。
We say that large models are "reading ten thousand books," and reinforcement learning is "traveling ten thousand miles." The large model itself is called "learning without thinking is labor lost," it reads many books but is internally chaotic. A typical work of standalone reinforcement learning is AlphaGo. I believe AlphaGo is the enlightenment of artificial intelligence; it is a typical representative of reinforcement learning, especially self-play. This technology was previously pursued by DeepMind, but they later found that it was "thinking without learning is perilous," getting stuck in a problem without a way out.
因此,这两个技术都有自己的局限性。 Therefore, both of these technologies have their own limitations.
在人工智能发展史上DeepMind有很多成果,像AlphaGo、AlphaZero,甚至一点数据都不用。只是很不幸,OpenAI从语言里面把通用智能往前走了一步,就是用学习的方法,用思考的方法。这种技术(学习和思考)早晚会合璧。
In the history of artificial intelligence development, DeepMind has many achievements, such as AlphaGo and AlphaZero, which don't even need any data. Unfortunately, OpenAI has taken a step forward in general intelligence from language, using learning and thinking methods. This kind of technology (learning and thinking) will eventually converge.
腾讯新闻《潜望》:有评论说“相比GPT-4o,o1模型向前迈了一步,却又退了两步”,你怎么看?
Tencent News "Qianwang": Some comments say "Compared to GPT-4, the o1 model took a step forward but then took two steps back." What do you think?
王小川:我不认为它叫进一步、退两步,也不叫转弯,就是范式升级了。
Wang Xiaochuan:I don't think it's called a step forward and two steps back, nor is it a turn; it's a paradigm upgrade.
快思考是慢思考诞生的一个过程。先得有快思考,才能有慢思考,它不是转弯的过程。而是,怎么用大模型的快思考,之后能够让它学会慢思考,是一个进阶。
Fast thinking is a process that gives birth to slow thinking. First, there must be fast thinking, then slow thinking can follow. It's not a turning process. Rather, it's about how to use the fast thinking of large models to eventually enable them to learn slow thinking, which is an advancement.
我沿用DIKW的模型——DIKW指的是从Data到Information到Knowledge最后到Wisdom,四步走。之前的搜索是Information这层,帮你获得信息。到了大模型LLM之后就到了Knowledge这层,它有知识、有沟通、有快思考。而今天有了这么一个慢思考,它已经从Knowledge进化到了有Wisdom的雏形,真的开始有智能。
I follow the DIKW model—DIKW refers to the four steps from Data to Information to Knowledge and finally to Wisdom.Previous searches were at the Information level, helping you obtain information. With the advent of large models like LLM, we have reached the Knowledge level, where it possesses knowledge, communication, and quick thinking. And today, with this slow thinking, it has evolved from Knowledge to the embryonic form of Wisdom, truly beginning to have intelligence.
因此它是范式升级,把原来的模型作为组件之一,不只是用今天这么一个强化学习服务于大模型。大模型是它的一个组件了,这是一个大的跨越。
Therefore, it is a paradigm upgrade, incorporating the original model as one of its components, not just using today's reinforcement learning to serve the large model. The large model has become one of its components, which is a significant leap.
总结来说,既不是转弯,也不是进退,而是摸到了一条走向慢思考的道路。
In summary, it is neither a turn nor a retreat, but rather finding a path towards slow thinking.
腾讯新闻《潜望》:为什么我们需要一个能慢思考的AI?它能帮我们解决哪些事?
Tencent News "Qianwang": Why do we need an AI that can think slowly? What problems can it help us solve?
王小川:智能本身一定是思考的过程。无人驾驶做端到端,在端到端里面也需要能一步、两步、三步去想。就像我们做几何定理证明题,求解一定得有思路。当你有思考过程,就是一个慢思考。所以,智能本身需要有多步的思考。
Wang Xiaochuan: Intelligence itself is definitely a process of thinking. Autonomous driving is done end-to-end, and within the end-to-end process, it also needs to think one step, two steps, three steps ahead. Just like solving geometric theorem problems, you need to have a thought process. When you have a thinking process, it is slow thinking. Therefore, intelligence itself requires multi-step thinking.
大多数比较复杂一点的问题,不管是代码问题,还是数据、逻辑问题,或者咱们日常想解决的事,都得分步骤解析,而不是一拍脑袋就有快思考结果。
For most relatively complex problems, whether they are code issues, data or logic problems, or things we want to solve in our daily lives, they need to be analyzed step by step, rather than having a quick thinking result immediately.
除了文学写作,可以用快思考写诗,一鼓作气做完了;大多数时候需要有多步,需要慢思考。
Apart from literary writing, where you can use quick thinking to write poetry and finish it in one go, most of the time multiple steps are needed, requiring slow thinking.
腾讯新闻《潜望》:o1隐藏了思维过程,甚至有人破解o1的思维链会被警告要封号,OpenAI为什么这么做?
Tencent News "Qianwang": o1 hides the thinking process, and some people who try to crack o1's thinking chain are warned that their accounts will be banned. Why does OpenAI do this?
王小川:之前做大模型,各家用它的数据做蒸馏,能够迅速接近它。OpenAI毕竟是一个商业公司,不是一个公益型公司。一旦公开,大家不仅仿效它的逻辑,更容易争夺它的数据;不仅破解它的算法,而且是它的数据获取。
Wang Xiaochuan: Previously, when making large models, each company used its data for distillation, allowing them to quickly approach it. After all, OpenAI is a commercial company, not a public welfare company. Once it is made public, not only will everyone imitate its logic, but it will also be easier to compete for its data; not only cracking its algorithm but also its data acquisition.
这会让别人家进步变得很快。也说明,这个技术本身独有性是有限的。
This will make others progress very quickly. It also shows that the uniqueness of this technology itself is limited.
因此,封锁是一个竞争策略。 Therefore, blockade is a competitive strategy.
Image
从以语言为核心走向思维链, From language-centric to thinking chain,
分两阶段运行增加泛化性 Running in two stages increases generalization
——这两句把强化学习的精髓讲完了 —These two sentences encapsulate the essence of reinforcement learning.
腾讯新闻《潜望》:我们应该怎么看待o1,是一个过渡性的产品形态吗?
Tencent News "Qianwang": How should we view o1, is it a transitional product form?
王小川:o1有点像当年GPT-3的发布,离最后3.5和4突破性有距离。但是3发布在业内就已经很震撼了。
Wang Xiaochuan:o1 is somewhat like the release of GPT-3 back in the day, still a distance away from the breakthrough of 3.5 and 4. But the release of 3 was already very shocking in the industry.
腾讯新闻《潜望》:我前几天和一个前OpenAI研究员(边塞科技创始人、清华叉院信息研究院助理教授吴翼)聊,他的研究方向是强化学习。他说,现在相当于从阶段一进入到阶段二。之前pre-training(预训练)能挖的金矿越来越少,大家意识到以强化学习为基础的post-training(后训练)是第二个大金矿,就让迈向AGI的梯子多了几节。
Tencent News "Qianwang": A few days ago, I talked with a former OpenAI researcher (Wu Yi, founder of Biansai Technology and assistant professor at Tsinghua University's Institute of Information Studies). His research direction is reinforcement learning. He said that now it is equivalent to moving from stage one to stage two. Previously, the gold mine that could be dug with pre-training was getting smaller and smaller, and everyone realized that post-training based on reinforcement learning is the second big gold mine, which adds a few more steps to the ladder towards AGI.
王小川:这个理解完全一样。
Wang Xiaochuan:This understanding is exactly the same.
腾讯新闻《潜望》:在你看来,OpenAI o1应该关注哪些要点?
Tencent News "Qianwang": In your opinion, what key points should OpenAI o1 focus on?
王小川:第一,它坚持以语言为中心,又叫语言中轴 Wang Xiaochuan: First, it insists on being language-centered, also known as the language axis.
之前大家多少觉得多模态是智力。但你看OpenAI上多模态帮助不大,现在还是语言为核心,甚至更进一步,从语言为核心走向CoT("Chain of Thought",思维链),更强调思维链了。因此语言在中间承载思考的时候变成一个多步的思考。
Previously, many people somewhat thought that multimodality was intelligence. But you see, multimodality doesn't help much on OpenAI; it is still centered around language, and even further, from language-centric to CoT ("Chain of Thought"), emphasizing the chain of thought more. Therefore, when language carries thinking in the middle, it becomes a multi-step thinking process.
第二,它把思考过程和给结果,分成了两个步骤,这样对思考过程能够更好进行一个泛化。
Secondly, it divides the thinking process and the result into two steps, which allows for better generalization of the thinking process.
比如说解数学题的时候,你可能是学会一个思路,就能解好多题。因此并不是他在追求这个题解对了的一个结果,而追求解题过程是对的。分成两个阶段之后,能把CoT变成一个能泛化到从你解一个数学题,到解更多题,甚至到其他领域共性能力提升。
For example, when solving math problems, you might learn a method that allows you to solve many problems. Therefore, it is not about pursuing the correct solution to a specific problem, but about ensuring the problem-solving process is correct. By dividing it into two stages, CoT can be generalized from solving one math problem to solving more problems, and even to enhancing common abilities in other fields.
因此,核心关注语言为核心的CoT,以及分两阶段的运行能增加泛化性——这两句话有很大信息量,已经把强化学习的精髓讲完了。
Therefore, the core focus on language-centric CoT and the two-stage operation can increase generalization—these two sentences contain a lot of information and have already explained the essence of reinforcement learning.
腾讯新闻《潜望》:能不能介绍一下“强化学习”这个概念?
Tencent News "Qianwang": Can you introduce the concept of "reinforcement learning"?
王小川:“强化学习”跟之前“监督学习”的区别是:监督学习你要告诉它解题过程是什么,它依葫芦画瓢;强化学习是不告诉它过程,只是判断你做得对不对。
Wang Xiaochuan: "The difference between 'reinforcement learning' and 'supervised learning' is: in supervised learning, you need to tell it the process of solving the problem, and it follows the example; in reinforcement learning, you don't tell it the process, you just judge whether it is right or wrong."
就像教小孩,你说你要做个事,一二三怎么做,小孩可能学得快,但他并没有“知其所以然”。但是如果他做对了,你说对;他做错了,你说错,这个小孩的学习要自己花心思找方法。这就是“强化学习”跟“监督学习”的本质区别。
It's like teaching a child. If you tell them how to do something step by step, they might learn quickly, but they don't understand the reasoning behind it. However, if they do it right, you say it's right; if they do it wrong, you say it's wrong. The child has to figure out the method on their own. This is the fundamental difference between 'reinforcement learning' and 'supervised learning.'
大模型为什么特别强调强化学习呢?大模型本质是把全天下最优质的语言来做训练,我们说是“一个压缩的过程”。压缩过程是一个在原有数据“分布内”的一种智能,它的思考能力是不会超过你原始数据的。
Why do large models particularly emphasize reinforcement learning? The essence of large models is to train using the best quality language from around the world, which we call "a compression process." The compression process is a kind of intelligence within the "distribution" of the original data, and its thinking ability will not exceed your original data.
但我们知道,单看智能是跳出原来的框架去想事。数学上叫“分布外”,大模型叫“分布内”,就在探索以前未知的事。所以这个时候需要创造环境,让你在这个环境互动当中,环境的反馈能给原来数据语言之外的内容,能提升你的解题问题或者智力。
But we know that true intelligence involves thinking outside the original framework. Mathematically, this is called "out-of-distribution," while large models are "in-distribution," exploring previously unknown things. Therefore, it is necessary to create an environment where interaction with the environment provides feedback beyond the original data language, enhancing your problem-solving ability or intelligence.
从“分布内”走向“分布外”,这是智力必须的过程。所以用强化学习就变成了必须的事。
Moving from "in-distribution" to "out-of-distribution" is a necessary process for intelligence. Therefore, using reinforcement learning becomes essential.
腾讯新闻《潜望》:这里面包含哪些关键技术原理,复刻它难吗?
Tencent News "Qianwang": What key technical principles are involved here, and is it difficult to replicate?
王小川:有很多数据和工程问题要做。复刻它本身,如果你蒸馏它会变得很简单;但复刻起来的时候,对于你的算力,对于你的专家来标注这个系统,都还是有蛮多挑战。
Wang Xiaochuan: There are many data and engineering problems to solve. Replicating it itself, if you distill it, will become very simple; but when replicating it, there are still many challenges for your computing power and for your experts to label this system.
比复刻个GPT-4会变得更难一些。 It will be more difficult than replicating GPT-4.
腾讯新闻《潜望》:还是需要专家、需要人工? Tencent News "Qianwang": Does it still require experts and manual work?
王小川:我觉得需要的,也需要人教它。
Wang Xiaochuan: I think it is needed, and it also needs people to teach it.
腾讯新闻《潜望》:Self-play RL(自博弈强化学习)能让人工更少吗?
Tencent News "Qianwang": Can Self-play RL (Reinforcement Learning) reduce human involvement?
王小川:肯定会。计算机领域有句话是这么说的:求解一个问题比验证一个问题更难。找问题答案比判断答案对不对更难。
Wang Xiaochuan:Definitely. There is a saying in the field of computer science: Solving a problem is harder than verifying a problem. Finding the answer to a problem is harder than judging whether the answer is correct.
你走迷宫,找到那条出路是难的,但要验证这个迷宫走得对不对、是不是走通了、有没有撞墙,是简单的。或者做几何定理证明题,你求解它难,但你找到求解过程后,让另一个人验证求解过程有没有bug是简单的。
When you navigate a maze, finding the way out is difficult, but verifying whether the path is correct, whether it is passable, or if you hit a wall, is simple. Similarly, in solving geometric theorem problems, finding the solution is hard, but once you have the solution process, it is easy for another person to verify whether there are any bugs in the solution process.
我们很愿意用强化学习,其中重要的是,我并不知道怎么解这个题,但我能验证你解得对不对。这个情况下能使整个系统的能力得到很大提升,也降低标注数据本身的难度,或者同等标注数据的难度,它就能解更复杂的题目,这是中间的核心逻辑。
We are very willing to use reinforcement learning. The important thing is, I don't know how to solve this problem, but I can verify whether your solution is correct. In this case, the capability of the entire system can be greatly improved, and the difficulty of labeling data itself is reduced. With the same level of labeled data difficulty, it can solve more complex problems. This is the core logic in between.
腾讯新闻《潜望》:强化学习能实现泛化性吗?它能够带来通用智力水平的提升吗?
Tencent News "Qianwang": Can reinforcement learning achieve generalization? Can it bring about an improvement in general intelligence levels?
王小川:强化学习之前的泛化性是不好的。AlphaGo之前就做得不好。
Wang Xiaochuan:The generalization before reinforcement learning was not good. AlphaGo did not do well before.
今天OpenAI基于这两件事,我觉得把强化学习做得挺好。第一,它局限在数学、代码,这个局部领域有足够大的突破。也说明这两个领域里有足够好的数据来验证它。比如数学题做得对不对?一个程序是否能编译通过?运行完了跟你想要的结果一不一样?所以,在没有泛化性和绝对答案的情况下,它做得特别好。
Today, based on these two things, I think OpenAI has done a good job with reinforcement learning. First, it is limited to mathematics and code, and there have been significant breakthroughs in these specific areas. It also shows that there is enough good data in these two fields to validate it. For example, are the math problems solved correctly? Can a program compile successfully? Does it run and produce the desired result? Therefore, in the absence of generalization and absolute answers, it performs particularly well.
第二,它的泛化性来自于之前把它分为两阶段,就是把CoT和后面的执行过程分开了因此,就像之前训练代码训练了之后,整个系统逻辑能力提升了一样。之前咱们也讲了GPT-3.5这两个版本合在一起之后,逻辑性提升来自于代码的学习。现在也是一样,其他场景的泛化性来自于对于数学和代码的CoT本身掌握得更好了,这个CoT能泛化到其他思考环节去。
Second, its generalization comes from previously dividing it into two stages, separating the CoT and the subsequent execution process. Therefore, just like after training the code, the overall system's logical ability improved. As we mentioned before, the logical improvement in GPT-3.5 after combining the two versions came from learning the code. Now it is the same; the generalization in other scenarios comes from a better grasp of the CoT in mathematics and code, which can generalize to other thinking processes.
腾讯新闻《潜望》:之前大家对GPT-4有一个诟病是数理能力比较差,o1变成了数学、编程方面的偏才,未来会出现更多专注于特定领域的模型吗?
Tencent News "Qianwang": Previously, there was a criticism of GPT-4 that its mathematical ability was relatively poor, and it became a specialist in mathematics and programming. Will there be more models focused on specific fields in the future?
王小川:我不觉得它是偏才,现在就是一个“文科也不错、理科一下子变得特别强”的模型。
Wang Xiaochuan: I don't think it is a specialist. It is now a model that is "good in liberal arts and suddenly becomes particularly strong in science."
至少OpenAI代表的路线图是通用的道路,会逐步把这样一个领域拓展开。并不代表以OpenAI自己积累的数据闭环就能做到全知全能。在各个领域使用的时候,专业领域的数据会扮演很重要的角色。
At least the roadmap represented by OpenAI is a general path that will gradually expand this field. It does not mean that OpenAI's own accumulated data loop can achieve omniscience. When used in various fields, data from specialized domains will play a very important role.
腾讯新闻《潜望》:做一个o1需要多少的算力、数据,有没有一个预估?
Tencent News "Qianwang": How much computing power and data are needed to make an o1, is there an estimate?
王小川:可能跟做个GPT-4差不多。
Wang Xiaochuan:It might be similar to making a GPT-4.
腾讯新闻《潜望》:o1 + GPT-4o会出现什么?
Tencent News "Qianwang": What will happen with o1 + GPT-4o?
王小川:不需要合并。现在叫o1,已经版本重置了。
Wang Xiaochuan:No need to merge. It's now called o1, and the version has been reset.
合并本身不难,即便包含不了,分两个调用也行。 Merging itself is not difficult, even if it cannot be included, it can be done in two calls.
腾讯新闻《潜望》:o1只是新范式的第一步,之后它会怎么演变?
Tencent News "Qianwang": o1 is just the first step of the new paradigm, how will it evolve afterwards?
王小川:它的算力继续增加、训练效率提升,以及如何在领域数据中更好地能去使用,还有大量可挖掘的内容。
Wang Xiaochuan: Its computing power continues to increase, training efficiency improves, and how to better use it in domain data, there is still a lot of content to be explored.
往下有几个事可能会发生:第一,领域的更好的泛化能力,就是找到范式把领域知识给做起来,是一个要突破的事。
Several things may happen next: First, better generalization ability in the field, that is, finding a paradigm to build up domain knowledge, is something that needs to be broken through.
第二个,再往下,我可以做个预言:未来代码会扮演更重要的角色。
Second, going further, I can make a prediction: in the future, code will play a more important role.
以前代码是帮助提高逻辑能力,或者帮助程序员辅助写代码。我认为未来代码会变成大模型下一步的核心能力。
Previously, code was used to help improve logical abilities or assist programmers in writing code. I believe that in the future, code will become a core capability under large models.
也就是说,大模型通过写代码能够去解决更多的问题,解决自身的思考过程,从强化学习范式还会走向写代码来解决问题这个新范式——这在未来几年内会实现。
In other words, large models will be able to solve more problems by writing code, addressing their own thought processes, and moving from the reinforcement learning paradigm to a new paradigm of solving problems by writing code—this will be achieved in the next few years.
Image
走出大厂射程, Stepping out of the range of big companies,
大模型“六小龙”至少能活一家 At least one of the "Six Little Dragons" of large models will survive.
腾讯新闻《潜望》:百川在强化学习这条路上是怎么做的?
Tencent News "Qianwang": How is Baichuan approaching the path of reinforcement learning?
王小川:百川一直挺重视强化学习,去年就成立这样一个团队。OpenAI是走在我们前面,这个得承认。
Wang Xiaochuan:Baichuan has always placed great importance on reinforcement learning and established such a team last year. OpenAI is ahead of us, we have to admit that.
我们在Baichuan3 发布做了一个实验,用强化学习训练诗词。做强化学习之前要靠金标准,是在能绝对判断对错的地方训练,所以通常要做理科任务,数学、代码是可以做的。文科上没有对错判断的标准,写得好不好挺难让机器校对。所以,我们想在文科里是否也有一个Reward Model(奖励模型),于是想到用唐诗和宋词。
When we released Baichuan3, we conducted an experiment using reinforcement learning to train poetry. Before doing reinforcement learning, we relied on the gold standard, which is training in areas where right and wrong can be absolutely judged, so we usually do science tasks. Mathematics and coding can be done. In the humanities, there is no standard for judging right or wrong, and it is quite difficult for machines to proofread whether the writing is good or not. Therefore, we wondered if there could be a Reward Model in the humanities as well, so we thought of using Tang and Song poetry.
尤其是宋词,大家写起来比较难,它的字数、平仄、韵律、对仗有很多要求。但是要求反而是一种规则。当时我们在训练模型的时候,不是说让机器仿照人这么写诗词,而是让机器写诗词之后,我们用一个程序模型来判断诗词写得是否符合字数、平仄、韵律和对仗。预训练时就做了这样一个实验,取得了不错效果,代表我们在这方面之前就有积累和思考。
Especially Song poetry, which is relatively difficult to write due to its requirements for word count, tones, rhythm, and parallelism. However, these requirements are actually a set of rules. When we were training the model, it wasn't about making the machine imitate human poetry, but rather having the machine write poetry and then using a program model to judge whether the poetry met the requirements for word count, tones, rhythm, and parallelism. We conducted such an experiment during pre-training and achieved good results, which shows that we had prior accumulation and thinking in this area.
再往下,除了数学和代码以外,我们认为医生是蛮好的可以用强化学习提升的领域。医疗在很多问题上是有标准答案的。比如一个病人,综合症状他到底有什么样的病?或者该做什么检验、检查,该开什么药?这些地方是有答案的。
Going further, besides mathematics and coding, we believe that the medical field is a good area where reinforcement learning can be applied. In many medical issues, there are standard answers.For example, for a patient, what kind of disease does he have based on his symptoms? Or what tests and examinations should be done, and what medicine should be prescribed? There are answers to these questions.
如果仿照医生的CoT再来验证答案对不对,这样能使模型的功力大涨。因为医生解释,不是光看医学院的书,读完就会了。他在临床中间一辈子可能看几万个病人,得到自己的提升。医生是在病人的互动中得到提升的,很多数据被记录下来。
If we verify the answers by imitating the doctor's Chain of Thought (CoT), it can greatly enhance the model's capabilities. Because a doctor's explanation is not just about reading medical school books and then knowing everything. Throughout his career, a doctor might see tens of thousands of patients and improve himself. Doctors improve through interactions with patients, and a lot of data is recorded.
因此,强化学习用来做医疗是一个特别好的应用方法,使医疗的可行性和质量得到很大提升。
Therefore, using reinforcement learning for medical applications is a particularly good method, greatly improving the feasibility and quality of healthcare.
腾讯新闻《潜望》:为什么你们当时做实验选择在诗词领域,是一个文科领域,而没有选择像数学、编程这种理科领域?
Tencent News "Qianwang": Why did you choose to conduct experiments in the field of poetry, a humanities field, instead of choosing science fields like mathematics or programming?
王小川:容易上手。
Wang Xiaochuan:Easy to get started.
你在做任何突破的时候都有挑战,因为它本身文科就好,只是文科上的不足是诗词,因此用它来做验证,比在当时做数学和代码更容易上手做实验。
There are challenges whenever you make any breakthroughs because humanities are inherently good, but the shortcoming in humanities is poetry. Therefore, using it for validation was easier than conducting experiments in mathematics and coding at that time.
腾讯新闻《潜望》:Reward Model怎么设计?
Tencent News "Qianwang": How is the Reward Model designed?
王小川:我们首先是会有程序能够判定,比如说这样一首诗词,这种字数。比如通过词牌名大概100多首,每个词牌名的格式我们有数据分析。并且平仄,一声、二声、三声、四声,还有韵律、押韵,都可以用程序校验,我们当时就已经写了Reward Model。先是有一个规则的判定,再把它泛化成一个模型——这个路线图比较接近o1的做法。
Wang Xiaochuan:First of all, we will have a program that can determine, for example, a poem like this, the number of words. For example, through the name of the ci poem, there are about 100 types, and we have data analysis for each format of the ci poem. Moreover, the tones, first tone, second tone, third tone, fourth tone, as well as rhythm and rhyme, can all be verified by the program. At that time, we had already written a Reward Model. First, there is a rule determination, and then it is generalized into a model—this roadmap is quite close to o1's approach.
但没有它做得更完整。o1特别好,有CoT的过程,我们当时不带CoT。
But it is not as complete as theirs. o1 is particularly good, it has the CoT process, we did not include CoT at that time.
腾讯新闻《潜望》:今天看到o1以后,你能复现出哪些技术路径,从中改进你们的做法?
Tencent News "Qianwang": After seeing o1 today, what technical paths can you replicate to improve your approach?
王小川:我们更强调CoT了,原来中间没有CoT这一步,直接从输入到答案。
Wang Xiaochuan: We emphasize CoT more now. Originally, there was no CoT step in the middle, it went directly from input to answer.
有CoT之后——第一,我们做医疗的时候会找医生的思考路径,这样更快提升它的能力,就是有CoT的过程,而不只是完整的端到端;第二,有CoT之后泛化能力也会得到很大提升,只要思路对,答案就对
After having CoT—first, when we do medical work, we will look for the doctor's thought process, which will improve its ability faster, that is, having the CoT process, rather than just a complete end-to-end; second, after having CoT, the generalization ability will also be greatly improved, as long as the thought process is correct, the answer will be correct.
腾讯新闻《潜望》:做了一年多强化学习,有没有积累更多关于强化学习的know-how?
Tencent News "Qianwang": After more than a year of reinforcement learning, have you accumulated more know-how about reinforcement learning?
王小川:强化学习一部分是从环境中学到新东西,一部分我发现它会激活原有一些能力。比如在写诗词,我们让它学会了字数、平仄和韵律,结果大模型自己就把对仗输出了——本来还没教它学对仗呢。
Wang Xiaochuan:Part of reinforcement learning is learning new things from the environment, and part of it, I found, activates some of the original abilities.For example, in writing poetry, we taught it the number of words, tones, and rhythms, and the large model itself outputted antithesis—it hadn't even been taught antithesis yet.
这就说明,它潜藏着记忆和能力,可以激活出来。所以在强化里,一方面是面向未来的范式,一方面它跟以前的强化学习逻辑也不完全一样。
This shows that it has latent memory and abilities that can be activated. So in reinforcement, on one hand, it is a future-oriented paradigm, and on the other hand, its logic is not entirely the same as previous reinforcement learning.
腾讯新闻《潜望》:但这几个月AI有点变冷,o1能够重振大家对于AI的信心吗?
Tencent News "Qianwang": But AI has cooled down a bit in the past few months, can o1 restore everyone's confidence in AI?
王小川:我不太关心外界环境,确实也听说外界在变冷,大家觉得比较迷茫,技术突破变慢,或者没找到应用场景。
Wang Xiaochuan:I'm not too concerned about the external environment. Indeed, I've heard that the outside world is cooling down, people feel more confused, technological breakthroughs are slowing down, or they haven't found application scenarios.
然而对于百川,一开始就明确了我们的应用场景是在知识领域里造顾问,尤其是造医生。场景很清晰,离结果更近了,而不是开辟新大陆。
However, for Baichuan, it was clear from the beginning that our application scenario was to create consultants in the field of knowledge, especially doctors. The scenario is very clear, closer to the result, rather than opening up new territory.
腾讯新闻《潜望》:国内的公司现在达到GPT-4水平了吗?
Tencent News "Qianwang": Have domestic companies now reached the level of GPT-4?
王小川:在接近吧。
Wang Xiaochuan:Getting close.
腾讯新闻《潜望》:复刻o1的时间周期相比GPT-4怎样?
Tencent News "Qianwang": Compared to the time cycle of replicatingGPT-4how is it?
王小川:会比做GPT-4快一些,难归难,但毕竟随着国内也好、美国也好这么多开源项目产生,不管大厂还是创业公司进入,资本的充裕度和人才的集中度已经比刚开始发布GPT-3.5或GPT-4之后市场的人才储备、资金储备多了很多。
Wang Xiaochuan:It will be faster than making GPT-4. Difficult as it is, with so many open-source projects emerging both domestically and in the US, whether it's big companies or startups entering the field, the abundance of capital and concentration of talent are much greater than when GPT-3.5 or GPT-4 was first released, in terms of market talent and funding reserves.
在一两个月时间里,有一些接近他们的模型就开始会出现了,会很快。
In one or two months, some models close to them will start to appear, and it will be very soon.
腾讯新闻《潜望》:你说国内还是国外? Tencent News "Qianwang": Are you talking about domestically or internationally?
王小川:都有可能。GPT-4比如用18个月,o1可能做到它那样也许9个月。起步有一个样子出来,可能1-2个月就有了。要达到一样的高度需要花力气。
Wang Xiaochuan:It's possible for both. For example, GPT-4 took 18 months, o1 might achieve that in maybe 9 months. There will be a prototype in 1-2 months. Reaching the same level will require effort.
腾讯新闻《潜望》:关于o1你有哪些想要知道但不知道的事?
Tencent News "Qianwang": What are some things about o1 that you want to know but don't?
王小川:挺多不知道,比如它拥有多少算力,有多少领域专家。
Wang Xiaochuan:There are quite a few unknowns, such as how much computing power it has and how many domain experts it has.
腾讯新闻《潜望》:o1可见上限是什么? Tencent News "Qianwang": What is the visible limit of o1?
王小川:我认为可能在未来两三年内,这个范式会跑出它的结果,跟GPT-3.5到4是一样的。
Wang Xiaochuan:I think it is possible that within the next two to three years, this paradigm will yield its results, similar to GPT-3.5 to 4.
剩下的就是代码可能会扮演更重要的角色——机器自己写代码,代码运行完了,生成一个神经元网络,甚至把神经元网络和它的模型再融合到一块去。
The remaining part is that code may play a more important role—machines writing code themselves, running the code, generating a neural network, and even integrating the neural network with its model.
我认为未来还有新的范式会产生。 I believe new paradigms will emerge in the future.
但是那一步做完了,我觉得AGI就接近了。 But once that step is completed, I think AGI will be close.
腾讯新闻《潜望》:你们接下来准备怎么做? Tencent News "Qianwang": What are you planning to do next?
王小川:一方面美国领先的地方要跟进,另一方面坚定在医疗场景上突破。
Wang Xiaochuan:On one hand, we need to follow up on areas where the US is leading, and on the other hand, we are determined to make breakthroughs in medical scenarios.
腾讯新闻《潜望》:你说去年是为了赶上这个时代火车,一个快速rush的状态,今年呢?
Tencent News "Qianwang": You said last year was about catching up with the train of the times, a fast rush state. What about this year?
王小川:我们去年不敢大声提医疗,我讲“医疗是大模型皇冠上的明珠”,大家不太理解这个场景的可行性。大家会问商业模式、伦理问题。
Wang Xiaochuan: Last year, we didn't dare to speak loudly about medical. I said, "Medical is the crown jewel of large models," but people didn't quite understand the feasibility of this scenario. People would ask about the business model and ethical issues.
去年只造一个轮子,模型赶快入场。今年开始,我们开始真正的双轮驱动,“超级模型”+“超级应用”。而且是一个“水涨船高的应用”,不只是“沿途下蛋的模式”。
Last year, we only made one wheel, and the model quickly entered the market. This year, we are starting true dual-wheel drive, "super model" + "super application". And it's a "rising tide lifts all boats" application, not just a "laying eggs along the way" model.
水涨船高的应用,什么意思?就是模型越大,我这个领域能做得更好;而不是模型大到一个阶段就跟我领域没关系了。“沿途下蛋”的意思就是我下了个蛋,就放那,模型再好,你就下个新的蛋。这种情况下,你的蛋越来越多,你自己就会被拖累了。
The application of "rising tide lifts all boats", what does it mean? It means that the larger the model, the better I can do in this field; rather than the model becoming so large that it has nothing to do with my field. The meaning of "laying eggs along the way" is that I lay an egg and leave it there. No matter how good the model is, you just lay a new egg. In this case, the more eggs you have, the more you will be dragged down.
因此,先做个广告模型放那儿,再做个客服模型放到那儿——这种情况不叫水涨船高,随着模型大就被淹掉的状态。而说到医疗,模型越大,这个行业可能存活率越大,这叫水涨船高。
Therefore, first make an advertising model and leave it there, then make a customer service model and leave it there—this situation is not called "rising tide lifts all boats", but rather being drowned as the model gets bigger. When it comes to healthcare, the larger the model, the higher the survival rate in this industry, which is called "rising tide lifts all boats".
腾讯新闻《潜望》:也就是说,假设模型能力特别强以后什么场景可以应用。
Tencent News "Qianwang": In other words, assuming the model's capabilities become particularly strong, what scenarios can it be applied to?
王小川:对。但是,模型在一般场景下,我也能进入。模型越大,这个场景就越受益,可以找这样一个场景。
Wang Xiaochuan:Yes. However, I can also enter general scenarios with the model. The larger the model, the more this scenario benefits, and you can find such a scenario.
腾讯新闻《潜望》:进去以后等着呗。 Tencent News "Qianwang": Just wait after entering.
王小川:当然得努力工作了。
Wang Xiaochuan:Of course, we have to work hard.
等着也对,模型越好,这个场景就越受益。 Waiting is also correct, the better the model, the more beneficial the scenario.
腾讯新闻《潜望》:模型和应用两条腿,你现在对哪一条腿更满意?
Tencent News "Qianwang": With both the model and the application, which one are you more satisfied with now?
王小川:都在初始状态。
Wang Xiaochuan:Both are in the initial state.
未来这两个也有关系,你的场景越清楚,对模型要求越细化。
In the future, these two are also related. The clearer your scenario, the more refined the requirements for the model.
腾讯新闻《潜望》:在医疗场景最终我们能看见的形态是什么?它好像不会是一个Super App,挺难想象的。
Tencent News "Qianwang": What is the final form we can see in the medical scene? It doesn't seem like it will be a Super App, it's quite hard to imagine.
王小川:以前是App,就是叫PMF(产品市场契合点),我去发现需求,去满足需求,去创造需求。这是以前的逻辑。我去年提了TPF(技术产品契合度),我们从需求驱动开始变成供给驱动,供给驱动就是这个需求已经现实存在,只是供给不足,我把它造出来,就有市场。我更多地强调技术和产品匹配度。
Wang Xiaochuan: In the past, it was an App, called PMF (Product-Market Fit), where I discovered needs, met needs, and created needs. This was the old logic. Last year, I proposed TPF (Technology-Product Fit), where we shifted from demand-driven to supply-driven. Supply-driven means that the need already exists in reality, but the supply is insufficient. If I create it, there will be a market. I emphasize more on the match between technology and product.
大模型一个很大的逻辑是在“造人”,在造数字员工,因为它会语言,会思考,会沟通,而且学的是人类之前遗留下来的知识和经验,所以它不是在造计算器、造车这种逻辑——它是在造人。我们把造医生当成重点突破。
A big logic of the large model is "creating people," creating digital employees, because it can speak, think, communicate, and learn from the knowledge and experience left by humans. So it's not about creating calculators or cars—it's about creating people. We focus on creating doctors as a key breakthrough.
如果从产品形态看,你就是造了一个能够用的医生,前期是从全科、儿科入手,未来会走向专科的医生,到最后走向生命的数学模型。这是下一个阶段,从智能模型走向生命模型,这是远期目标。在机器智能模型里,它像智能人一样,就是个医生。
From a product form perspective, you are creating a usable doctor. Initially, it starts with general practice and pediatrics, and in the future, it will move towards specialized doctors, eventually reaching the mathematical model of life. This is the next stage, moving from intelligent models to life models, which is a long-term goal. In the machine intelligent model, it is like an intelligent person, essentially a doctor.
腾讯新闻《潜望》:人类跟它的交互界面会是什么? Tencent News "Qianwang": What will be the interface for human interaction with it?
王小川:靠自然语言的交互。
Wang Xiaochuan:Interaction relies on natural language.
有可能你是个App,有可能是医院里一个终端设备,但最后它的交互是靠语言进行。语言或者视觉,跟人一样。
It could be an app, or it could be a terminal device in a hospital, but in the end, its interaction is based on language. Language or vision, just like humans.
腾讯新闻《潜望》:在产品上什么时候百川让大家看到一个大的突破?
Tencent News "Qianwang": When will Baichuan show a major breakthrough in the product?
王小川:今年内吧。今年开始可能能接触到一点了,算是与人对话的。
Wang Xiaochuan:Within this year. You might start to experience a bit of it this year, as it involves dialogue with people.
腾讯新闻《潜望》:你上次说会造三个人,除了医生,另两个人你现在怎么想?
Tencent News "Qianwang": You mentioned last time that you would create three people. Besides the doctor, what do you think about the other two now?
王小川:更通用的顾问我们也会做。
Wang Xiaochuan:We will also create more general consultants.
娱乐我们降低了。娱乐的目的是造虚拟世界,时间还没有到。所以我们现在可以等一等,先把通用的顾问和医生造出来。
We have reduced entertainment.The purpose of entertainment is to create a virtual world, but the time has not yet come. So we can wait for now and first create the general consultant and the doctor.
我们想的娱乐不是跟你聊天的一个东西,而是造一个能够去创造世界、一个叙事的故事。这里缺少足够多的数据和资源训练它。
What we think of as entertainment is not just something to chat with you, but to create a world and a narrative story. There is a lack of sufficient data and resources to train it.
腾讯新闻《潜望》:聊天机器人这个市场现在是一个红海,终局会是什么样?
Tencent News "Qianwang": The chatbot market is currently a red ocean. What will the endgame look like?
王小川:它是不是个市场都不知道,就别说叫红海市场了。
Wang Xiaochuan:We don't even know if it's a market, let alone call it a Red Ocean market.
腾讯新闻《潜望》:大模型创业“六小龙”能活几家? Tencent News "Qianwang": How many of the "Six Little Dragons" in large model entrepreneurship can survive?
王小川:至少活一家吧。
Wang Xiaochuan:At least one should survive.
腾讯新闻《潜望》:怎么看字节和大模型创业公司的竞争?
Tencent News "Qianwang": What do you think about the competition between ByteDance and large model startups?
王小川:字节就饱和式攻击嘛,在一种共识里,字节是会发展非常快。但一定有比他们更高的认知,他们看不到的东西,或者他们组织能力做不到的事,才会有创业公司生存的机会。
Wang Xiaochuan:ByteDance is using a saturation attack strategy. In a certain consensus, ByteDance will develop very quickly. But there must be higher insights than theirs, things they can't see, or things their organizational capabilities can't achieve, which will provide opportunities for startups to survive.
走出大厂的射程,在这个射程内你是没什么好活的。 Get out of the range of the big factory, within this range you have no good life.






“复盘中国大模型系列”推荐阅读
“Review of China's Large Models Series” Recommended Reading
Image
Image
Image
Image
Image
Image

复盘中国大模型 · 目录 Review of China's Large Models · Directory
上一篇 Previous播客更新|和何小鹏聊,FSD、“在血海游泳”、乱世中的英雄与狗熊
Podcast Update | Chatting with He Xiaopeng about FSD, "Swimming in a Sea of Blood," Heroes and Cowards in Troubled Times

Scan to Follow

继续滑动看下一个 Keep scrolling to see the next one
腾讯科技  Tencent Technology
向上滑动看下一个 Swipe up to see the next one