Generative AI’s Act o1 生成性人工智能法 o1
代理推理时代开始
Two years into the Generative AI revolution, research is progressing the field from “thinking fast”—rapid-fire pre-trained responses—to “thinking slow”— reasoning at inference time. This evolution is unlocking a new cohort of agentic applications.
在生成性人工智能革命的两年里,研究正将这一领域从“快速思考”——快速预训练的反馈——转向“慢思考”——推理时的思考。这一变化正在解锁一批新的智能应用。
On the second anniversary of our essay “Generative AI: A Creative New World,” the AI ecosystem looks very different, and we have some predictions for what’s on the horizon.
在我们文章“生成性 AI:一个创意新世界”发表两周年之际,AI 生态系统看起来非常不同,我们对未来的一些预测。
The foundation layer of the Generative AI market is stabilizing in an equilibrium with a key set of scaled players and alliances, including Microsoft/OpenAI, AWS/Anthropic, Meta and Google/DeepMind. Only scaled players with economic engines and access to vast sums of capital remain in play. While the fight is far from over (and keeps escalating in a game-theoretic fashion), the market structure itself is solidifying, and it’s clear that we will have increasingly cheap and plentiful next-token predictions.
生成式人工智能市场的基础层正在与一组关键的大型参与者和联盟保持平衡,这些参与者包括微软/OpenAI、AWS/Anthropic、Meta 和谷歌/DeepMind。只有拥有经济引擎和巨额资金的规模玩家才会继续参与。虽然竞争远未结束(并且以一种博弈论的方式不断升级),但市场结构本身正在巩固,显然我们将会有越来越便宜且丰富的下一个 Token 预测。
As the LLM market structure stabilizes, the next frontier is now emerging. The focus is shifting to the development and scaling of the reasoning layer, where “System 2” thinking takes precedence. Inspired by models like AlphaGo, this layer aims to endow AI systems with deliberate reasoning, problem-solving and cognitive operations at inference time that go beyond rapid pattern matching. And new cognitive architectures and user interfaces are shaping how these reasoning capabilities are delivered to and interact with users.
随着LLM市场结构的稳定,下一阶段正在浮现。重点转向推理层的开发和扩展,在这里,“系统 2”思维占主导地位。受到像 AlphaGo 这样的模型的启发,这一层旨在赋予 AI 系统在推理时进行深思熟虑的推理、解决问题和认知操作的能力,而不仅仅是快速的模式匹配。新的认知架构和用户界面正在塑造这些推理能力如何传递给用户以及与用户的互动方式。
What does all of this mean for founders in the AI market? What does this mean for incumbent software companies? And where do we, as investors, see the most promising layer for returns in the Generative AI stack?
这对 AI 市场的创始人意味着什么?这对现有的软件公司又意味着什么?而我们作为投资者,看到了生成式 AI 领域中收益最有前景的层面在哪里呢?
In our latest essay on the state of the Generative AI market, we’ll explore how the consolidation of the foundational LLM layer has set the stage for the race to scale these higher-order reasoning and agentic capabilities, and discuss a new generation of “killer apps” with novel cognitive architectures and user interfaces.
在我们最新的关于生成式 AI 市场状况的文章中,我们将探讨基础的LLM层的整合如何为扩展更高阶推理和自主能力的竞争奠定基础,还将讨论一代新的“杀手级应用”,它们具有新颖的认知架构和用户界面。
Strawberry Fields Forever
草莓田,永远在这里
The most important model update of 2024 goes to OpenAI with o1, formerly known as Q* and also known as Strawberry. This is not just a reassertion of OpenAI’s rightful place atop the model quality leaderboards, but also a notable improvement on the status quo architecture. More specifically, this is the first example of a model with true general reasoning capabilities, which they’ve achieved with inference-time compute.
2024 年最重要的模型更新来自 OpenAI,名为 o1,之前叫 Q*,也被称为草莓。这不仅仅是 OpenAI 再次证明自己在模型质量排行榜上的领先地位,还在现有架构上有了显著改进。更具体来说,这是第一个具备真正一般推理能力的模型,他们是通过推理时的计算实现的。
What does that mean? Pre-trained models are doing next token prediction on an enormous amount of data. They rely on “training-time compute.” An emergent property of scale is basic reasoning, but this reasoning is very limited. What if you could teach a model to reason more directly? This is essentially what’s happening with Strawberry. When we say “inference-time compute” what we mean is asking the model to stop and think before giving you a response, which requires more compute at inference time (hence “inference-time compute”). The “stop and think” part is reasoning.
这是什么意思呢?预训练模型在海量数据上进行下一个标记的预测。它们依赖于“训练时计算”。规模的一个涌现特性是基本推理,但这种推理非常有限。如果你能教一个模型更直接地推理,那会怎样呢?这就是草莓模型正在做的事情。当我们说“推理时计算”,我们的意思是让模型在给你答案前先停下来思考,这在推理时需要更多计算(所以叫“推理时计算”)。那“停下来思考”的部分就是推理。
AlphaGo x LLMs
So what is the model doing when it stops and thinks?
模型在停下来思考时在干嘛呢?
Let’s first take a quick detour to March 2016 in Seoul. One of the most seminal moments in deep learning history took place here: AlphaGo’s match against legendary Go master Lee Sedol. This wasn’t just any AI-vs-human match—it was the moment the world saw AI do more than just mimic patterns. It was thinking.
让我们先快速回到 2016 年 3 月的首尔。在这里,深度学习历史上一个重要的时刻发生了:AlphaGo 与传奇围棋大师李世石的对弈。这不仅仅是一场人类与 AI 的对决—这是世界第一次看到 AI 不仅仅是在模仿模式。那一刻,它在思考。
What made AlphaGo different from previous gameplay AI systems, like Deep Blue? Like LLMs, AlphaGo was first pre-trained to mimic human experts from a database of roughly 30 million moves from previous games and more from self-play. But rather than provide a knee jerk response that comes out of the pre-trained model, AlphaGo takes the time to stop and think. At inference time, the model runs a search or simulation across a wide range of potential future scenarios, scores those scenarios, and then responds with the scenario (or answer) that has the highest expected value. The more time AlphaGo is given, the better it performs. With zero inference-time compute, the model can’t beat the best human players. But as the inference time scales, AlphaGo gets better and better—until it surpasses the very best humans.
AlphaGo 和之前的游戏 AI 系统,比如 Deep Blue,有什么不同呢?就像 LLMs,AlphaGo 首先经过预训练,模仿人类专家,使用了大约 3000 万个以前游戏的走法数据库,还包括自我对弈产生的更多走法。不过,AlphaGo 不会像那些预训练模型那样机械地反应,而是会停下来好好思考。在推理时,模型会在多种潜在未来场景中进行搜索或模拟,给这些场景打分,然后选择一个期望值最高的场景(或答案)来回应。给 AlphaGo 的时间越多,它的表现就越好。在没有推理时间计算的情况下,模型无法击败最优秀的人类玩家。但随着推理时间的增加,AlphaGo 的表现也会越来越好——直到超越人类最顶尖的选手。
Let’s bring it back to the LLM world. What’s hard about replicating AlphaGo here is constructing the value function, or the function by which the responses are scored. If you’re playing Go, it’s more straightforward: you can simulate the game all the way to the end, see who wins, and then calculate an expected value of the next move. If you’re coding, it’s somewhat straightforward: you can test the code and see if it works. But how do you score the first draft of an essay? Or a travel itinerary? Or a summary of key terms in a long document? This is what makes reasoning hard with current methods, and it’s why Strawberry is comparatively strong on domains proximate to logic (e.g. coding, math, the sciences) and not as strong in domains that are more open-ended and unstructured (e.g. writing).
让我们回到LLM的世界。复制 AlphaGo 难点在于构建价值函数,也就是用来评分的响应函数。如果你在下围棋,这相对简单:你可以模拟整个游戏,到最后看看谁赢,然后计算下一步的期望值。如果你在编程,也有点简单:你可以测试代码,看看它是否有效。但是,如何给一篇文章的初稿评分呢?或者给一个旅行计划评分?又或者给一份长文档中的关键术语做总结呢?这就是当前方法让推理变得困难的原因,因此草莓在逻辑相关领域(例如编程、数学、科学)比较强,但在更开放和无结构的领域(例如写作)却不那么强。
While the actual implementation of Strawberry is a closely guarded secret, the key ideas involve reinforcement learning around the chains of thought generated by the model. Auditing the model’s chains of thought suggests that something fundamental and exciting is happening that actually resembles how humans think and reason. For example, o1 is showing the ability to backtrack when it gets stuck as an emergent property of scaling inference time. It is also showing the ability to think about problems the way a human would (e.g. visualize the points on a sphere to solve a geometry problem) and to think about problems in new ways (e.g. solving problems in programming competitions in a way that humans would not).
虽然草莓的实际实现是一个高度保密的事情,但其关键理念涉及围绕模型生成的思维链的强化学习。审计模型的思维链表明,实际上发生了一些根本且令人兴奋的事情,这与人类的思考和推理有点相似。比如,o1 在卡住时展现出回溯的能力,这是一种随着推理时间扩展而出现的特性。它还表现出像人类一样思考问题的能力(例如,通过可视化球面上的点来解决几何问题),以及以新方式思考问题的能力(比如,在编程比赛中以人类无法想到的方式解决问题)。
And there is no shortage of new ideas to push forward inference-time compute (e.g. new ways of calculating the reward function, new ways of closing the generator/verifier gap) that research teams are working on as they try to improve the model’s reasoning capabilities. In other words, deep reinforcement learning is cool again, and it’s enabling an entire new reasoning layer.
而且在推展推理时间计算方面(例如,计算奖励函数的新方法、缩小生成器/验证器差距的新方式)研究团队提出的新想法层出不穷,他们正在努力提升模型的推理能力。换句话说,深度强化学习又变得酷炫起来,它正在开启一个全新的推理层。
System 1 vs System 2 Thinking
系统 1 与系统 2 思维
This leap from pre-trained instinctual responses (”System 1”) to deeper, deliberate reasoning (“System 2”) is the next frontier for AI. It’s not enough for models to simply know things—they need to pause, evaluate and reason through decisions in real time.
从预训练的本能反应(“系统 1”)跃升到更深层次的深思熟虑推理(“系统 2”)是人工智能的下一个前沿。仅仅让模型知道事情是不够的,它们还需要实时停下来、评估并推理做出决定。
Think of pre-training as the System 1 layer. Whether a model is pre-trained on millions of moves in Go (AlphaGo) or petabytes of internet-scale text (LLMs), its job is to mimic patterns—whether that’s human gameplay or language. But mimicry, as powerful as it is, isn’t true reasoning. It can’t properly think its way through complex novel situations, especially those out of sample.
把预训练想象成系统 1 层。无论模型是在数百万的围棋动作上预训练(AlphaGo)还是在互联网规模的文本(LLMs)上进行预训练,它的工作就是模仿模式——无论是人类的游戏还是语言。不过,尽管模仿能力强大,但这并不是真正的推理。它无法正确地在复杂的新情况中进行思考,尤其是在样本外的情况。
This is where System 2 thinking comes in, and it’s the focus of the latest wave of AI research. When a model “stops to think,” it isn’t just generating learned patterns or spitting out predictions based on past data. It’s generating a range of possibilities, considering potential outcomes and making a decision based on reasoning.
这就是系统 2 思维的作用,也是最新一波 AI 研究的重点。当一个模型“停下来思考”时,它不仅仅是在生成学到的模式或根据过去的数据进行预测。它还在生成一系列可能性,考虑潜在的结果,并基于推理做出决策。
For many tasks, System 1 is more than enough. As Noam Brown pointed out on our latest episode of Training Data, thinking for longer about what the capital of Bhutan is doesn’t help—you either know it or you don’t. Quick, pattern-based recall works perfectly here.
对于很多任务来说,系统 1 完全足够。正如诺姆·布朗在我们最新一期的 训练数据 中指出的,想更久也没用,你要么知道不丹的首都是哪里,要么不知道。快速的基于模式的回忆在这里完全奏效。
But when we look at more complex problems—like breakthroughs in mathematics or biology—quick, instinctive responses don’t cut it. These advances required deep thinking, creative problem-solving and—most importantly—time. The same is true for AI. To tackle the most challenging, meaningful problems, AI will need to evolve beyond quick in-sample responses and take its time to come up with the kind of thoughtful reasoning that defines human progress.
但是当我们面对更复杂的问题,比如数学或生物学的突破时,快速的、直觉的反应就不够了。这些进展需要深思熟虑、创造性的解决方案,而且最重要的是——时间。人工智能也是如此。要应对最具挑战性和意义的问题,人工智能需要超越快速的样本反应,花时间来形成那种定义人类进步的深思熟虑的推理。
A New Scaling Law: The Inference Race is On
新的扩展法则:推理竞赛开始了
The most important insight from the o1 paper is that there’s a new scaling law in town.
来自o1 论文的一个最重要的启示是,出现了一种新的缩放定律。
Pre-training LLMs follows a well understood scaling law: the more compute and data you spend on pre-training the model, the better it performs.
预训练LLMs遵循一个大家都理解的规模法则:投入越多的计算和数据进行模型预训练,模型的表现就会越好。
The o1 paper has opened up an entire new plane for scaling compute: the more inference-time (or “test-time”) compute you give the model, the better it reasons.
o1 论文为计算扩展打开了一个全新的领域:你给模型的推理时间(或“测试时间”)计算越多,它的推理就越好。
Source: OpenAI o1 technical report
来源:OpenAI o1 技术报告
What happens when the model can think for hours? Days? Decades? Will we solve the Riemann Hypothesis? Will we answer Asimov’s last question?
当模型能思考几个小时?几天?几十年?我们会解决黎曼假设吗?我们会回答阿西莫夫的最后一个问题吗?
This shift will move us from a world of massive pre-training clusters toward inference clouds—environments that can scale compute dynamically based on the complexity of the task.
这种转变将使我们从庞大的预训练集群转向推理云——可以根据任务复杂性动态扩展计算能力的环境。
One Model to Rule Them All?
一个模型来统治一切?
What happens as OpenAI, Anthropic, Google and Meta scale their reasoning layers and develop more and more powerful reasoning machines? Will we have one model to rule them all?
随着 OpenAI、Anthropic、Google 和 Meta 不断升级它们的推理层,开发出越来越强大的推理机器,会发生什么呢?我们会有一个模型来统治一切吗?
One hypothesis at the outset of the Generative AI market was that a single model company would become so powerful and all-encompassing that it would subsume all other applications. This prediction has been wrong so far in two ways.
最初的生成 AI 市场有一个假设,就是会有一家模型公司变得如此强大和全面,以至于它会吞并所有其他应用。不过到目前为止,这个预测在两个方面都是错的。
First, there is plenty of competition at the model layer, with constant leapfrogging for SOTA capabilities. It’s possible that someone figures out continuous self-improvement with broad domain self play and achieves takeoff, but at the moment we have seen no evidence of this. Quite to the contrary, the model layer is a knife-fight, with price per token for GPT-4 coming down 98% since the last dev day.
首先,模型层竞争非常激烈,各种先进的能力不断被超越。现在有人可能会通过广域自我博弈找到持续自我改进的方法并实现飞跃,但目前为止我们还没有看到这种迹象。恰恰相反,模型层就像一场刀剑斗争,GPT-4 的每个 token 价格自上次开发日以来下降了 98%。
Second, the models have largely failed to make it into the application layer as breakout products, with the notable exception of ChatGPT. The real world is messy. Great researchers don’t have the desire to understand the nitty gritty end-to-end workflows of every possible function in every possible vertical. It is both appealing and economically rational for them to stop at the API, and let the developer universe worry about the messiness of the real world. Good news for the application layer.
其次,这些模型在应用层几乎没有成功推出突破性的产品,除了 ChatGPT 以外。现实世界是复杂的。优秀的研究者并不想深入了解每个领域每个功能的繁琐流程。对他们来说,停留在 API 层面并让开发者们去处理现实世界的复杂性,更具有吸引力和经济合理性。这对应用层来说是个好消息。
The Messy Real World: Custom Cognitive Architectures
杂乱无章的现实世界:定制认知架构
The way you plan and prosecute actions to reach your goals as a scientist is vastly different from how you would work as a software engineer. Moreover, it’s even different as a software engineer at different companies.
作为科学家,你规划和实施行动以实现目标的方式和作为软件工程师时完全不同。此外,作为软件工程师在不同公司之间的工作方式也有所不同。
As the research labs further push the boundaries on horizontal general-purpose reasoning, we still need application or domain-specific reasoning to deliver useful AI agents. The messy real world requires significant domain and application-specific reasoning that cannot efficiently be encoded in a general model.
随着研究实验室不断拓展通用水平推理的边界,我们依然需要针对特定应用或领域的推理,来提供实用的人工智能代理。复杂的现实世界需要大量领域和应用特定的推理,这些是无法高效地在通用模型中编码的。
Enter cognitive architectures, or how your system thinks: the flow of code and model interactions that takes user input and performs actions or generates a response.
输入认知架构,也就是你的系统是如何思考的:处理用户输入、执行操作或生成响应的代码和模型交互的流程。
For example, in the case of Factory, each of their “droid” products has a custom cognitive architecture that mimics the way that a human thinks to solve a specific task, like reviewing pull requests or writing and executing a migration plan to update a service from one backend to another. The Factory droid will break down all of the dependencies, propose the relevant code changes, add unit tests and pull in a human to review. Then after approval, run the changes across all of the files in a dev environment and merge the code if all the tests pass. Just like how a human might do it—in a set of discrete tasks rather than one generalized, black box answer.
例如,在 Factory 的例子中,他们的每款“机器人”产品都有一个定制的认知架构,能够模拟人类思考解决特定任务的方式,比如审核拉取请求或编写并执行迁移计划,以将服务从一个后端更新到另一个后端。Factory 的机器人会拆解所有依赖关系,提出相关的代码变更,添加单元测试,并邀请人类进行审查。然后,经过批准后,它会在开发环境中运行所有文件的变更,如果所有测试通过,就合并代码。这就像人类处理问题一样——通过一系列明确的任务,而不是一个笼统的黑箱回答。
What’s Happening with Apps?
应用程序发生了什么事?
Imagine you want to start a business in AI. What layer of the stack do you target? Do you want to compete on infra? Good luck beating NVIDIA and the hyperscalers. Do you want to compete on the model? Good luck beating OpenAI and Mark Zuckerberg. Do you want to compete on apps? Good luck beating corporate IT and global systems integrators. Oh. Wait. That actually sounds pretty doable!
想象一下你想在人工智能领域开一家企业。你想瞄准哪个层级?想在基础设施上竞争?祝你好运,打败 NVIDIA 和大型云服务商可不容易。想在模型上竞争?祝你好运,要胜过 OpenAI 和马克·扎克伯格可不简单。想在应用上竞争?祝你好运,打败企业 IT 和全球系统集成商可不是件轻松事。不过等等,这听起来其实还挺可行的!
Foundation models are magic, but they’re also messy. Mainstream enterprises can’t deal with black boxes, hallucinations and clumsy workflows. Consumers stare at a blank prompt and don’t know what to ask. These are opportunities in the application layer.
基础模型很神奇,但也很复杂。主流企业无法处理黑箱、幻觉和笨重的工作流程。消费者盯着一个空白的提示,不知道该问什么。这些都是应用层的机会。
Two years ago, many application layer companies were derided as “just a wrapper on top of GPT-3.” Today those wrappers turn out to be one of the only sound methods to build enduring value. What began as “wrappers” have evolved into “cognitive architectures.”
两年前,很多应用层公司被嘲笑为“只是 GPT-3 的外壳”。今天,这些外壳被证明是创造持久价值的唯一可靠方法之一。最初的“外壳”已经演变成了“认知架构”。
Application layer AI companies are not just UIs on top of a foundation model. Far from it. They have sophisticated cognitive architectures that typically include multiple foundation models with some sort of routing mechanism on top, vector and/or graph databases for RAG, guardrails to ensure compliance, and application logic that mimics the way a human might think about reasoning through a workflow.
应用层的人工智能公司并不只是建立在基础模型之上的用户界面。远不止如此。它们拥有复杂的认知架构,通常包括多个基础模型以及某种路由机制,上面还有用于 RAG 的向量和/或图形数据库,确保合规性的护栏,以及模仿人类在工作流程中推理思考方式的应用逻辑。
Service-as-a-Software 软件即服务
The cloud transition was software-as-a-service. Software companies became cloud service providers. This was a $350B opportunity.
云端转型就是软件即服务。软件公司变成了云服务提供商。这是一个 3500 亿美元的机会。
Thanks to agentic reasoning, the AI transition is service-as-a-software. Software companies turn labor into software. That means the addressable market is not the software market, but the services market measured in the trillions of dollars.
由于主动推理,人工智能转型是服务即软件。软件公司把劳动转化为软件。这意味着可寻址的市场不是软件市场,而是以万亿美元计算的服务市场。
What does it mean to sell work? Sierra is a good example. B2C companies put Sierra on their website to talk with customers. The job-to-be-done is to resolve a customer issue. Sierra gets paid per resolution. There is no such thing as “a seat”. You have a job to be done. Sierra does it. They get paid accordingly.
出售工作是什么意思?Sierra 就是个很好的例子。B2C 公司把 Sierra 放在他们的网站上用来和客户沟通。需要完成的任务是解决客户的问题。Sierra 每解决一个问题就会获得报酬。没有所谓的“一个座位”。你有任务要完成,Sierra 就来帮你。然后他们就按这个得到报酬。
This is the true north for many AI companies. Sierra benefits from having a graceful failure mode (escalation to a human agent). Not all companies are so lucky. An emerging pattern is to deploy as a copilot first (human-in-the-loop) and use those reps to earn the opportunity to deploy as an autopilot (no human in the loop). GitHub Copilot is a good example of this.
这是很多 AI 公司的真正方向。Sierra 得益于拥有优雅的失败模式(升 escalated 到人类代理)。并不是所有公司都有这样的运气。一种新兴的模式是先作为副驾驶(人机协作)进行部署,然后利用这些代表来争取以自动驾驶(无人的情况)进行部署的机会。GitHub Copilot 就是一个很好的例子。
A New Cohort of Agentic Applications
一组新的自主应用程序
With Generative AI’s budding reasoning capabilities, a new class of agentic applications is starting to emerge.
随着生成式人工智能推陈出新的推理能力,一类新的自主应用开始崭露头角。
What shape do these application layer companies take? Interestingly, these companies look different than their cloud predecessors:
这些应用层公司是什么样子的?有趣的是,这些公司和他们的云计算前辈看起来不一样:
- Cloud companies targeted the software profit pool. AI companies target the services profit pool.
云公司瞄准了软件利润池。AI 公司则关注服务利润池。 - Cloud companies sold software ($ / seat). AI companies sell work ($ / outcome)
云计算公司卖的是软件(每个座位多少钱)。AI 公司卖的是工作(按结果收费)。 - Cloud companies liked to go bottoms-up, with frictionless distribution. AI companies are increasingly going top-down, with high-touch, high-trust delivery models.
云公司喜欢从底层开始,以无缝的分发方式。人工智能公司则越来越倾向于从顶层开始,采用高接触、高信任的交付模式。
We are seeing a new cohort of these agentic applications emerge across all sectors of the knowledge economy. Here are some examples.
我们看到这些自主型应用在知识经济的各个领域涌现出一批新产品。以下是一些例子。
- Harvey: AI lawyer 哈维:人工智能律师
- Glean: AI work assistant Glean: AI 工作助手
- Factory: AI software engineer
工厂:人工智能软件工程师 - Abridge: AI medical scribe
简化:AI 医疗记录助手 - XBOW: AI pentester XBOW:AI 渗透测试工具
- Sierra: AI customer support agent
塞拉:人工智能客服助手
By bringing the marginal cost of delivering these services down—in line with the plummeting cost of inference—these agentic applications are expanding and creating new markets.
通过降低提供这些服务的边际成本,与推理成本的快速下跌保持一致,这些智能应用正在扩展并创造新的市场。
Take XBOW, for example. XBOW is building an AI “pentester.” A “pentest” or penetration test is a simulated cyberattack on a computer system that companies perform in order to evaluate their own security systems. Before Generative AI, companies hired pentesters only in limited circumstances (e.g. when required for compliance), because human pentesting is expensive: it’s a manual task performed by a highly skilled human. However, XBOW is now demonstrating automated pentests built on the latest reasoning LLMs that match the performance of the most highly skilled human pentesters. This multiplies the pentesting market and opens up the possibility of continuous pentesting for companies of all shapes and sizes.
以 XBOW 为例。XBOW 正在构建一个 AI“渗透测试工具”。“渗透测试”是对计算机系统进行的模拟网络攻击,企业通过这个测试来评估自身的安全系统。在生成性 AI 出现之前,企业仅在有限的情况下(例如,合规要求时)雇佣渗透测试人员,因为人力渗透测试成本高昂:这是一项由高技能人类执行的手动任务。然而,XBOW 现在正在展示基于最新推理的自动化渗透测试,LLMs的表现达到了最顶尖人类渗透测试人员的水平。这大大拓展了渗透测试市场,并为各类企业提供了持续进行渗透测试的可能性。
What does this mean for the SaaS universe?
这对 SaaS 领域意味着什么?
Earlier this year we met with our Limited Partners. Their top question was “will the AI transition destroy your existing cloud companies?”
今年早些时候,我们和我们的有限合伙人会面。他们最关心的问题是:“人工智能的转型会摧毁你们现有的云公司吗?”
We began with a strong default of “no.” The classic battle between startups and incumbents is a horse race between startups building distribution and incumbents building product. Can the young companies with cool products get to a bunch of customers before the incumbents who own the customers come up with cool products? Given that so much of the magic in AI is coming from the foundation models, our default assumption has been no—the incumbents will do just fine, because those foundation models are just as accessible to them as they are to the startup universe, and they have the preexisting advantages of data and distribution. The primary opportunity for startups is not to replace incumbent software companies—it’s to go after automatable pools of work.
我们一开始就默认“拒绝”。初创公司和老牌企业之间的经典对抗就是一场马赛,初创公司在建立分销渠道,而老牌企业则在开发产品。年轻的公司能否在拥有客户的老牌企业推出酷炫产品之前,先把它们的酷炫产品推向一大批客户?考虑到 AI 中的很多魔力来自基础模型,我们的默认假设是不会的——老牌企业会做得很好,因为这些基础模型对他们来说和对初创公司一样容易获得,而且他们还拥有数据和分销的先发优势。初创公司主要的机会不是取代老牌软件公司,而是去争取可自动化的工作领域。
That being said, we are no longer so sure. See above re: cognitive architectures. There’s an enormous amount of engineering required to turn the raw capabilities of a model into a compelling, reliable, end-to-end business solution. What if we’re just dramatically underestimating what it means to be “AI native”?
那么,我们现在不太确定了。关于认知架构的内容请见上文。要把一个模型的基本能力转化为一个引人注目且可靠的完整商业解决方案,需要大量的工程工作。如果我们只是大大低估了“AI 原生”的含义呢?
Twenty years ago the on-prem software companies scoffed at the idea of SaaS. “What’s the big deal? We can run our own servers and deliver this stuff over the internet too!” Sure, conceptually it was simple. But what followed was a wholesale reinvention of the business. EPD went from waterfalls and PRDs to agile development and AB testing. GTM went from top-down enterprise sales and steak dinners to bottoms-up PLG and product analytics. Business models went from high ASPs and maintenance streams to high NDRs and usage-based pricing. Very few on-prem companies made the transition.
二十年前,传统软件公司对 SaaS 的概念嗤之以鼻。“有什么大不了的?我们也可以自己搭服务器,通过互联网提供这些服务!”当然,从概念上来说很简单。但接下来发生的却是商业模式的彻底重塑。EPD 从瀑布式开发和 PRD 转变为敏捷开发和 AB 测试。GTM 从自上而下的企业销售和商务晚餐,转变为自下而上的产品引导和产品分析。商业模式从高 ASP 和维护收入转变为高 NDR 和基于使用的定价。很少有传统公司成功转型。
What if AI is an analogous shift? Could the opportunity for AI be both selling work and replacing software?
如果人工智能是一种类似的转变呢?人工智能的机会可能是 既 销售工作 又 替代软件吗?
With Day.ai, we have seen a glimpse of the future. Day is an AI native CRM. Systems integrators make billions of dollars configuring Salesforce to meet your needs. With nothing but access to your email and calendar and answers to a one-page questionnaire, Day automatically generates a CRM that is perfectly tailored to your business. It doesn’t have all the bells and whistles (yet), but the magic of an auto-generated CRM that remains fresh with zero human input is already causing people to switch.
通过 Day.ai,我们窥见了未来。Day 是一个原生于 AI 的客户关系管理系统。系统集成商花费数十亿配置 Salesforce 来满足你的需求。只需访问你的电子邮件和日历以及填写一份一页问卷,Day 就能自动生成一个完全符合你业务的客户关系管理系统。虽然还没有所有的花里胡哨,但这种无需人力输入的自动生成的客户关系管理系统所带来的新鲜感,已经让许多人开始转变。
The Investment Universe 投资宇宙
Where are we spending our cycles as investors? Where is funding being deployed? Here’s our quick take.
我们作为投资者投入精力在哪里?资金在哪里被使用?这是我们的简要看法。
Infrastructure 基础设施
This is the domain of hyperscalers. It’s being driven by game theoretic behavior, not microeconomics. Terrible place for venture capitalists to be.
这是超大规模公司的领域。这是由博弈理论驱动的,不是微观经济学。对风险投资者来说,这地方可真糟糕。
Models 模型
This is the domain of hyperscalers and financial investors. Hyperscalers are trading balance sheets for income statements, investing money that’s just going to round-trip back to their cloud businesses in the form of compute revenue. Financial investors are skewed by the “wowed by science” bias. These models are super cool and these teams are incredibly impressive. Microeconomics be damned!
这就是超大规模公司和金融投资者的领域。超大规模公司用资产负债表换收入表,把钱投入到云业务中,最终又会以计算收入的形式回到他们手里。金融投资者则受到了“被科学震撼”的偏见影响。这些模型太酷了,这些团队也超级厉害。微观经济算什么!
Developer tools and infrastructure software
开发工具和基础设施软件
Less interesting for strategics and more interesting for venture capitalists. ~15 companies with $1Bn+ of revenue were created at this layer during the cloud transition, and we suspect the same could be true with AI.
对策略者来说没那么有趣,但对风险投资者来说更有意思。在云转型期间,这一层涌现了大约 15 家公司,年收入超过 10 亿美元,我们怀疑在 AI 领域也可能会出现类似的情况。
Apps 应用程序
The most interesting layer for venture capital. ~20 application layer companies with $1Bn+ in revenue were created during the cloud transition, another ~20 were created during the mobile transition, and we suspect the same will be true here.
风险投资最有意思的层面。在云转型期间,出现了大约 20 家年 revenue 超过 10 亿美元的应用层公司,移动转型期间又创造了大约 20 家,我们猜这里也会出现同样的情况。
Closing Thoughts 结尾想法
In Generative AI’s next act, we expect to see the impact of reasoning R&D ripple into the application layer. These ripples are fast and deep. Most of the cognitive architectures to date incorporate clever “unhobbling” techniques; now that these capabilities are becoming baked deeper into the models themselves, we expect that agentic applications will become much more sophisticated and robust, quickly.
在生成性人工智能的下一步中,我们预计推理研发的影响将渗透到应用层。这些变化来得既快又深。到目前为止,大多数认知架构都采用了巧妙的“去除限制”技术;现在这些能力正逐渐深入模型内部,我们预计智能应用将变得更加复杂和强大,很快就会实现。
Back in the research lab, reasoning and inference-time compute will continue to be a strong theme for the foreseeable future. Now that we have a new scaling law, the next race is on. But for any given domain, it is still hard to gather real-world data and encode domain and application-specific cognitive architectures. This is again where last-mile app providers may have the upper hand in solving the diverse set of problems in the messy real world.
回到研究实验室,推理和推断计算在可预见的未来仍将是一个重要主题。现在我们有了新的扩展法则,接下来的竞争开始了。但对于任何特定领域,收集真实世界的数据以及编码领域和应用特定的认知架构仍然很难。最后一公里的应用提供商在解决混乱的现实世界中各种问题上可能仍然占据优势。
Thinking ahead, multi-agent systems, like Factory’s droids, may begin to proliferate as ways of modeling reasoning and social learning processes. Once we can do work, we can have teams of workers accomplishing so much more.
考虑到未来,多智能体系统,比如工厂的机器人,可能会开始大量出现,成为模拟推理和社会学习过程的新方式。一旦我们能够工作,就可以拥有团队,让工人们完成更多的任务。
What we’re all eagerly awaiting is Generative AI’s Move 37, that moment when – like in AlphaGo’s second game against Lee Sedol – a general AI system surprises us with something superhuman, something that feels like independent thought. This does not mean that the AI “wakes up” (AlphaGo did not) but that we have simulated processes of perception, reasoning and action that the AI can explore in truly novel and useful ways. This may in fact be AGI, and if so it will not be a singular occurrence, it will merely be the next phase of technology.
我们都在热切期待生成式人工智能的第 37 步,类似于阿尔法狗与李世乭第二局比赛中的那一刻,当一个通用的人工智能系统以超乎寻常的方式让我们感到惊讶,感觉像是独立思考。这并不意味着人工智能“觉醒”(阿尔法狗并没有觉醒),而是我们模拟了感知、推理和行动的过程,人工智能可以以真正新颖且有用的方式进行探索。这实际上可能就是人工通用智能,如果是这样的话,这并不会是一次性的事件,而只是技术的下一个阶段。