After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but certain applications might deviate. This is what the overall architecture looks like.
研究了企业部署生成式 AI 应用的方法后,我发现它们的平台存在许多共通之处。本文将概述生成式 AI 平台的通用组件,它们的功能以及实施方式。我尽力使架构保持通用,但某些特定应用可能会有所偏离。这就是整体架构的概貌。

Overview of a genai platform


This is a pretty complex system. This post will start from the simplest architecture and progressively add more components. In its simplest form, your application receives a query and sends it to the model. The model generates a response, which is returned to the user. There are no guardrails, no augmented context, and no optimization. The Model API box refers to both third-party APIs (e.g., OpenAI, Google, Anthropic) and self-hosted APIs.
这确实是一个相当复杂的系统。本篇帖子将从最基础的架构讲起,逐步添加更多组件。在最简化的形式下,您的应用程序接收查询并将其发送给模型。模型生成响应,然后将响应返回给用户。这里没有安全防护,没有增强上下文,也没有任何优化。"模型 API"框既指第三方 API(例如 OpenAI、Google、Anthropic)也包括自托管 API。

Overview of a genai platform


From this, you can add more components as needs arise. The order discussed in this post is common, though you don’t need to follow the exact same order. A component can be skipped if your system works well without it. Evaluation is necessary at every step of the development process.
从这里开始,您可以根据需要增加更多的组件。本文讨论的顺序很常见,但你不必完全遵循。如果系统没有某个组件也能良好运行,那么可以跳过它。在开发过程的每一步,评估都是必要的。

  1. Enhance context input into a model by giving the model access to external data sources and tools for information gathering.
    通过为模型提供外部数据源和信息收集工具,增强模型对上下文的输入,使其能够更有效地进行信息整合。
  2. Put in guardrails to protect your system and your users.
    设置防护措施,以保障您的系统和用户安全。
  3. Add model router and gateway to support complex pipelines and add more security.
    添加模型路由器和网关以支持更复杂的管道流程,并增强安全性。
  4. Optimize for latency and costs with cache.
    通过优化缓存,减少延迟,降低成本。
  5. Add complex logic and write actions to maximize your system’s capabilities.
    添加复杂的逻辑并编写操作,以充分发挥您系统的潜力。

Observability, which allows you to gain visibility into your system for monitoring and debugging, and orchestration, which involves chaining all the components together, are two essential components of the platform. We will discuss them at the end of this post.
可观察性,让你能够洞察系统,进行监控和调试;以及编排,即将所有组件巧妙地串联起来,这两者都是平台不可或缺的组成部分。我们将在本文最后详细讨论这两个概念。

» What this post is not «
» 这篇帖子不是关于什么 «

This post focuses on the overall architecture for deploying AI applications. It discusses what components are needed and considerations when building these components. It’s not about how to build AI applications and, therefore, does NOT discuss model evaluation, application evaluation, prompt engineering, finetuning, data annotation guidelines, or chunking strategies for RAGs. All these topics are covered in my upcoming book AI Engineering.
本文着重于部署人工智能应用的整体架构,探讨了构建这些系统所需的各种组件,以及在构建过程中需要考虑的关键因素。但本文不涉及如何实际构建人工智能应用,因此不会讨论模型评估、应用评估、提示工程、微调、数据标注指南,以及为 RAGs 制定的分块策略。所有这些内容都将在我的新书《人工智能工程》中详细阐述。


Table of contents 目录
Step 1. Enhance Context
步骤 1. 加强情境理解

….RAGs …. RAGs The translation seems to be missing. The content "…. RAGs" cannot be translated as it is not a complete or clear phrase in English. Please provide a complete sentence or phrase for accurate translation.
….RAGs with tabular data
与表格数据相关的 RAGs

….Agentic RAGs …… 代理 RAGs 的职责
….Query rewriting …… 查询重写功能
Step 2. Put in Guardrails
第二步,设置防护栏。

….Input guardrails 输入保护栏
……..Leaking private information to external APIs
…….. 泄露个人信息至外部 API 接口

……..Model jailbreaking …….. 模型的越狱操作
….Output guardrails … 输出的防护栏
……..Output quality measurement
…….. 输出质量的测量

……..Failure management …….. 失败管理处理
….Guardrail tradeoffs …. 护栏交易的权衡考量
Step 3. Add Model Router and Gateway
第 3 步,添加模型路由器和网关

….Router 省略号后面是路由器
….Gateway 省略号后面是网关
Step 4. Reduce Latency with Cache
第 4 步。通过使用缓存减少延迟,提升速度。

….Prompt cache …… 提示缓存的初步翻译
….Exact cache 。。。精确的缓存
….Semantic cache …… 语义缓存区
Step 5. Add complex logic and write actions
第 5 步,添加复杂的逻辑并编写操作步骤。

….Complex logic ……复杂的逻辑处理
….Write actions 请撰写行动内容
Observability 可观测性
….Metrics 一些指标
….Logs 这里是日志...
….Traces …… 追踪的痕迹
AI Pipeline Orchestration
人工智能管道编排

Conclusion 结论
References and Acknowledgments
参考资料与致谢



Step 1. Enhance Context
步骤 1. 加强情境理解

The initial expansion of a platform usually involves adding mechanisms to allow the system to augment each query with the necessary information. Gathering the relevant information is called context construction.
平台初期的扩展通常会加入一些机制,让系统能为每次查询添加所需的信息。收集这些相关的信息,我们称之为构建上下文。

Many queries require context to answer. The more relevant information there is in the context, the less the model has to rely on its internal knowledge, which can be unreliable due to its training data and training methodology. Studies have shown that having access to relevant information in the context can help the model generate more detailed responses while reducing hallucinations (Lewis et al., 2020).
许多问题的回答需要依赖于上下文。上下文中包含的相关信息越多,模型就越不需要依赖其内部知识,而这种知识可能由于训练数据和训练方法的局限性而存在不确定性。研究显示,当模型能够访问上下文中相关的信息时,它能生成更详尽的答案,同时减少产生错误信息的可能性(Lewis 等人,2020)。

For example, given the query “Will Acme’s fancy-printer-A300 print 100pps?”, the model will be able to respond better if it’s given the specifications of fancy-printer-A300. (Thanks Chetan Tekur for the example.)
例如,如果询问“Acme 的 fancy-printer-A300 能否达到每秒 100 页的打印速度?”当模型获得 fancy-printer-A300 的详细规格时,它将能更准确地回答。(感谢 Chetan Tekur 提供的例子。)

Context construction for foundation models is equivalent to feature engineering for classical ML models. They serve the same purpose: giving the model the necessary information to process an input.
为基本模型构建上下文等同于为传统机器学习模型进行特征工程。两者的目的相同:向模型提供处理输入所需的信息,使其能够正确运行。

In-context learning, learning from the context, is a form of continual learning. It enables a model to incorporate new information continually to make decisions, preventing it from becoming outdated. For example, a model trained on last-week data won’t be able to answer questions about this week unless the new information is included in its context. By updating a model’s context with the latest information, e.g. fancy-printer-A300’s latest specifications, the model remains up-to-date and can respond to queries beyond its cut-off date.
在情境中学习,即从情境中汲取知识,是一种持续学习的方式。它让模型能够持续吸收新信息,以做出决策,避免模型变得过时。例如,仅用上周数据训练的模型,如果不将新信息纳入其情境中,将无法回答本周的问题。通过将最新信息(如 fancy-printer-A300 的最新规格)更新到模型的情境中,模型就能保持最新状态,回答超出其截止日期的问题。

RAGs 区域活动指南

The most well-known pattern for context construction is RAG, Retrieval-Augmented Generation. RAG consists of two components: a generator (e.g. a language model) and a retriever, which retrieves relevant information from external sources.
最广为人知的上下文构建模式是 RAG,即检索增强生成。RAG 包含两个组件:一个生成器(如语言模型)和一个检索器,用于从外部资源中检索相关的信息。

Overview of a genai platform


Retrieval isn’t unique to RAGs. It’s the backbone of search engines, recommender systems, log analytics, etc. Many retrieval algorithms developed for traditional retrieval systems can be used for RAGs.
检索并非 RAGs 独有,它是搜索引擎、推荐系统、日志分析等领域的核心。许多传统检索系统中开发的检索算法,同样适用于 RAGs。

External memory sources typically contain unstructured data, such as memos, contracts, news updates, etc. They can be collectively called documents. A document can be 10 tokens or 1 million tokens. Naively retrieving whole documents can cause your context to be arbitrarily long. RAG typically requires documents to be split into manageable chunks, which can be determined from the model’s maximum context length and your application’s latency requirements. To learn more about chunking and the optimal chunk size, see Pinecone, Langchain, Llamaindex, and Greg Kamradt’s tutorials.
外部存储通常包含非结构化数据,例如备忘录、合同、新闻更新等,这些可以统称为文档。文档的大小可能从 10 个令牌到 100 万个令牌不等。如果直接检索整个文档,可能会导致上下文长度变得非常长。RAG 通常需要将文档分割成易于管理的片段,这取决于模型的最大上下文长度和应用程序的延迟需求。若想了解更多关于文档分块和最优分块大小的信息,可以参考 Pinecone、Langchain、Llamaindex 以及 Greg Kamradt 的教程。

Once data from external memory sources has been loaded and chunked, retrieval is performed using two main approaches.
从外部存储源加载并分割数据后,主要采用两种方法进行检索。

  1. Term-based retrieval 以术语为基础的检索方式
    This can be as simple as keyword search. For example, given the query “transformer”, fetch all documents containing this keyword. More sophisticated algorithms include BM25 (which leverages TF-IDF) and Elasticsearch (which leverages inverted index).
    这可能就像关键词搜索一样简单。例如,如果查询“transformer”,则检索所有包含此关键词的文档。更复杂的算法包括 BM25(利用 TF-IDF)和 Elasticsearch(利用倒排索引)。


    Term-based retrieval is usually used for text data, but it also works for images and videos that have text metadata such as titles, tags, captions, comments, etc.
    基于术语的检索通常用于文本数据,但同样适用于那些有文本元数据如标题、标签、字幕、评论等的图像和视频。


  2. Embedding-based retrieval (also known as vector search)
    基于嵌入的检索(又称向量搜索)

    You convert chunks of data into embedding vectors using an embedding model such as BERT, sentence-transformers, and proprietary embedding models provided by OpenAI or Google. Given a query, the data whose vectors are closest to the query embedding, as determined by the vector search algorithm, is retrieved.
    你利用像 BERT、sentence-transformers,以及 OpenAI 或 Google 提供的专有嵌入模型这样的嵌入模型,将数据块转换成嵌入向量。当有查询时,向量搜索算法会找出与查询嵌入最接近的数据向量,从而进行检索。


    Vector search is usually framed as nearest-neighbor search, using approximate nearest neighbor (ANN) algorithms such as FAISS (Facebook AI Similarity Search), Google’s ScaNN, Spotify’s ANNOY, and hnswlib (Hierarchical Navigable Small World).
    向量搜索通常被表述为最近邻搜索问题,采用近似最近邻(ANN)算法,如 FAISS(Facebook AI 相似性搜索)、Google 的 ScaNN、Spotify 的 ANNOY 以及 hnswlib(层次化可导航小世界图)等工具来实现。

    The ANN-benchmarks website compares different ANN algorithms on multiple datasets using four main metrics, taking into account the tradeoffs between indexing and querying.
    ANN-benchmarks 网站在多个数据集上,利用四大指标比较不同的 ANN 算法,同时兼顾索引和查询之间的权衡。

    • Recall: the fraction of the nearest neighbors found by the algorithm.
      回顾:算法找到的最近邻中所占的比例。
    • Query per second (QPS): the number of queries the algorithm can handle per second. This is crucial for high-traffic applications.
      每秒查询次数(QPS):即算法每秒能处理的查询数量,这对高流量应用而言至关重要。
    • Build time: the time required to build the index. This metric is important especially if you need to frequently update your index (e.g. because your data changes).
      构建时间:这是建立索引所需的时间。当您需要频繁更新索引时(例如,数据经常变动),这个指标显得尤为重要。
    • Index size: the size of the index created by the algorithm, which is crucial for assessing its scalability and storage requirements.
      索引大小:这是算法生成的索引大小,对于评估其扩展性和存储需求至关重要。


    This works with not just text documents, but also images, videos, audio, and code. Many teams even try to summarize SQL tables and dataframes and then use these summaries to generate embeddings for retrieval.
    这不仅限于文本文件,还涵盖了图像、视频、音频和代码等多种类型。许多团队甚至尝试对 SQL 表和数据框进行总结,然后利用这些总结生成嵌入式内容,以便于检索。

Term-based retrieval is much faster and cheaper than embedding-based retrieval. It can work well out of the box, making it an attractive option to start. Both BM25 and Elasticsearch are widely used in the industry and serve as formidable baselines for more complex retrieval systems. Embedding-based retrieval, while computationally expensive, can be significantly improved over time to outperform term-based retrieval.
以术语为基础的检索比以嵌入为基础的检索更快、更经济。它开箱即用,表现良好,成为起步的优选。BM25 和 Elasticsearch 在业界广泛应用,为更复杂的检索系统提供了强大的基准。尽管基于嵌入的检索计算成本高,但可以随时间显著优化,从而超越基于术语的检索。

A production retrieval system typically combines several approaches. Combining term-based retrieval and embedding-based retrieval is called hybrid search.
生产型检索系统通常会融合多种方法。将基于词条的检索与基于嵌入式的检索相结合,这种技术被称为混合搜索。

One common pattern is sequential. First, a cheap, less precise retriever, such as a term-based system, fetches candidates. Then, a more precise but more expensive mechanism, such as k-nearest neighbors, finds the best of these candidates. The second step is also called reranking.
一个常见的模式是按顺序进行。首先,一个成本较低但精确度不高的检索器,比如基于词条的系统,会筛选出一批候选结果。然后,一个虽然成本较高但精确度更高的机制,比如 k-最近邻算法,会从这些候选结果中找出最佳选项。第二步也被称为重新排序。

For example, given the term “transformer”, you can fetch all documents that contain the word transformer, regardless of whether they are about the electric device, the neural architecture, or the movie. Then you use vector search to find among these documents those that are actually related to your transformer query.
例如,给定“变压器”这个术语,你可以检索出所有包含“变压器”这个词的文档,无论它们是关于电器设备、神经网络架构还是电影。然后,你使用向量搜索从这些文档中找出与你的“变压器”查询真正相关的内容。

Context reranking differs from traditional search reranking in that the exact position of items is less critical. In search, the rank (e.g., first or fifth) is crucial. In context reranking, the order of documents still matters because it affects how well a model can process them. Models might better understand documents at the beginning and end of the context, as suggested by the paper Lost in the middle (Liu et al., 2023). However, as long as a document is included, the impact of its order is less significant compared to in search ranking.
上下文重新排序与传统的搜索重新排序不同,项目的确切位置并不那么关键。在搜索中,排名(例如,第一或第五)至关重要。在上下文重新排序中,文档的顺序仍然很重要,因为它影响模型处理它们的能力。根据论文《迷失在中间》(Liu 等人,2023 年)的建议,模型可能更好地理解上下文开始和结束的文档。然而,只要文档被包含在内,其顺序的影响与搜索排名相比就不那么重要。

Another pattern is ensemble. Remember that a retriever works by ranking documents by their relevance scores to the query. You use multiple retrievers to fetch candidates at the same time, then combine these different rankings together to generate a final ranking.
另一种模式是集成。请记住,检索器是通过将文档与查询的相关性评分进行排序来工作的。您可以同时使用多个检索器来获取候选文档,然后将这些不同的排序结果结合起来,生成最终的排序结果。

RAGs with tabular data
带表格数据的 RAGs(红绿灯状态与表格数据结合)

External data sources can also be structured, such as dataframes or SQL tables. Retrieving data from an SQL table is significantly different from retrieving data from unstructured documents. Given a query, the system works as follows.
外部数据源同样可以是结构化的形式,比如数据框或 SQL 表格。从 SQL 表格中获取数据与从非结构化文档中提取数据有着显著的不同。在给定查询的情况下,系统的工作流程如下。

  1. Text-to-SQL: Based on the user query and the table schemas, determine what SQL query is needed.
    根据用户的查询和表结构,判断需要执行的 SQL 语句。
  2. SQL execution: Execute the SQL query.
    执行 SQL 查询:运行 SQL 查询语句。
  3. Generation: Generate a response based on the SQL result and the original user query.
    生成:基于 SQL 查询结果及原始用户问题,生成相应的回答。
Overview of a genai platform


For the text-to-SQL step, if there are many available tables whose schemas can’t all fit into the model context, you might need an intermediate step to predict what tables to use for each query. Text-to-SQL can be done by the same model used to generate the final response or one of many specialized text-to-SQL models.
在文本转 SQL 的步骤中,如果存在大量表,其结构无法全部容纳在模型的上下文中,你可能需要一个中间步骤来预测每个查询应使用哪些表。文本转 SQL 可以由生成最终响应的同一模型,或由专门的文本转 SQL 模型之一来完成。

Agentic RAGs 代理 RAGs(注:RAGs 在中文中通常被翻译为“红黄绿”指标,但在这个上下文中,我们保留了原文的缩写,因为它可能指的是特定的代理评级系统。)

An important source of data is the Internet. A web search tool like Google or Bing API can give the model access to a rich, up-to-date resource to gather relevant information for each query. For example, given the query “Who won Oscar this year?”, the system searches for information about the latest Oscar and uses this information to generate the final response to the user.
互联网是数据的重要来源。像 Google 或 Bing API 这样的网络搜索工具,可以为模型提供丰富且实时更新的资源,以收集与每个查询相关的信息。例如,当查询“今年谁赢得了奥斯卡?”时,系统会搜索有关最新奥斯卡的信息,并利用这些信息生成最终的用户回应。

Term-based retrieval, embedding-based retrieval, SQL execution, and web search are actions that a model can take to augment its context. You can think of each action as a function the model can call. A workflow that can incorporate external actions is also called agentic. The architecture then looks like this.
以术语为基础的检索、基于嵌入的检索、SQL 执行和网络搜索,这些都是模型可以采取以增强其上下文的动作。你可以将每个动作看作是模型可以调用的函数。能够整合外部动作的工作流程也被称为具有代理能力。架构因此看起来是这样的。

Overview of a genai platform


» Action vs. tool «
行动与工具的对比

A tool allows one or more actions. For example, a people search tool might allow two actions: search by name and search by email. However, the difference is minimal, so many people use action and tool interchangeably.
一个工具可以执行一个或多个操作。例如,一个人名搜索工具可能提供两种操作:按名字搜索和按电子邮件搜索。然而,这两种操作的差异微乎其微,因此很多人会将操作和工具这两个概念混用。

» Read-only actions vs. write actions «
只读操作与写操作的区别

Actions that retrieve information from external sources but don’t change their states are read-only actions. Giving a model write actions, e.g. updating the values in a table, enables the model to perform more tasks but also poses more risks, which will be discussed later.
从外部源获取信息而不改变其状态的操作被称为只读操作。赋予模型写操作,比如更新表格中的值,可以让模型执行更多任务,但同时也带来了更多风险,这个问题我们将在后面详细讨论。

Query rewriting 查询重写优化

Often, a user query needs to be rewritten to increase the likelihood of fetching the right information. Consider the following conversation.
很多时候,为了提高获取正确信息的几率,我们需要重写用户的查询。请考虑以下的对话。

User: When was the last time John Doe bought something from us?
AI: John last bought a Fruity Fedora hat from us two weeks ago, on January 3, 2030.
User: How about Emily Doe?

The last question, “How about Emily Doe?”, is ambiguous. If you use this query verbatim to retrieve documents, you’ll likely get irrelevant results. You need to rewrite this query to reflect what the user is actually asking. The new query should make sense on its own. The last question should be rewritten to “When was the last time Emily Doe bought something from us?”
最后一个问题,“Emily Doe 的情况如何?”含义模糊。如果你直接用这个查询去检索文档,可能会得到很多不相关的结果。你需要重新构建这个查询,以准确反映用户想问的问题。新的查询应该逻辑清晰,独立成句。最后一个问题应该改写为“Emily Doe 最后一次从我们这里购买商品是什么时候?”

Query rewriting is typically done using other AI models, using a prompt similar to “Given the following conversation, rewrite the last user input to reflect what the user is actually asking.”
查询重写通常利用其他 AI 模型来完成,其操作方式类似于给出一段对话,然后重写最后一句用户提问,确保准确表达用户的真实意图。

Overview of a genai platform


Query rewriting can get complicated, especially if you need to do identity resolution or incorporate other knowledge. If the user asks “How about his wife?”, you will first need to query your database to find out who his wife is. If you don’t have this information, the rewriting model should acknowledge that this query isn’t solvable instead of hallucinating a name, leading to a wrong answer.
查询重写可能变得复杂,尤其是在需要进行身份识别或整合其他知识的情况下。如果用户问“他的妻子怎么样?”,你首先得查询数据库,找出他的妻子是谁。如果没有这些信息,重写模型应该承认这个查询无法解答,而不是臆造一个名字,从而给出错误答案。

Step 2. Put in Guardrails
第 2 步,安装护栏。

Guardrails help reduce AI risks and protect not just your users but also you, the developers. They should be placed whenever there is potential for failures. This post discusses two types of guardrails: input guardrails and output guardrails.
防护栏能有效降低 AI 风险,不仅保护用户,也保护开发者您自己。在可能出现故障的地方,护栏是必要的。本文将探讨两种护栏类型:输入护栏和输出护栏。

Input guardrails 输入保护栏

Input guardrails are typically protection against two types of risks: leaking private information to external APIs, and executing bad prompts that compromise your system (model jailbreaking).
输入保护通常旨在防范两种风险:一是防止私密信息泄露给外部 API,二是避免执行可能损害系统安全的恶意指令(如模型破解)。

Leaking private information to external APIs
泄露个人信息至外部 API,这种情况可能指的是应用程序或系统在与外部服务通信时,不慎将用户的敏感信息,如账号、密码、身份信息等,暴露给了不应获取这些信息的第三方 API

This risk is specific to using external model APIs when you need to send your data outside your organization. For example, an employee might copy the company’s secret or a user’s private information into a prompt and send it to wherever the model is hosted.
这种风险特别针对使用外部模型 API 的情况,即当你需要将数据发送至组织外。例如,员工可能将公司的机密或用户的私人信息复制到提示中,然后将其发送到托管模型的地方。


One of the most notable early incidents was when Samsung employees put Samsung’s proprietary information into ChatGPT, accidentally leaking the company’s secrets. It’s unclear how Samsung discovered this leak and how the leaked information was used against Samsung. However, the incident was serious enough for Samsung to ban ChatGPT in May 2023.
早期最引人注目的事件之一是,三星员工不慎将公司的专有信息输入到 ChatGPT 中,导致公司秘密意外泄露。目前尚不清楚三星是如何发现这一信息泄露的,以及泄露的信息是如何被用来对付三星的。然而,这一事件的严重性足以让三星在 2023 年 5 月全面禁止使用 ChatGPT。


There’s no airtight way to eliminate potential leaks when using third-party APIs. However, you can mitigate them with guardrails. You can use one of the many available tools that automatically detect sensitive data. What sensitive data to detect is specified by you. Common sensitive data classes are:
在使用第三方 API 时,没有绝对可靠的方法来杜绝潜在的数据泄露。但是,你可以通过设置防护措施来降低风险。你可以利用现有的多种工具,这些工具能自动检测敏感数据。至于检测哪些敏感数据,这由你来决定。常见的敏感数据类别包括:

  • Personal information (ID numbers, phone numbers, bank accounts).
    个人资料(身份证号码,电话号码,银行账户信息)。
  • Human faces. 人类的面容。
  • Specific keywords and phrases associated with the company’s intellectual properties or privileged information.
    与公司知识产权或机密信息相关的特定关键词和短语。

Many sensitive data detection tools use AI to identify potentially sensitive information, such as determining if a string resembles a valid home address. If a query is found to contain sensitive information, you have two options: block the entire query or remove the sensitive information from it. For instance, you can mask a user’s phone number with the placeholder [PHONE NUMBER]. If the generated response contains this placeholder, use a PII reversible dictionary that maps this placeholder to the original information so that you can unmask it, as shown below.
许多敏感数据检测工具利用人工智能识别潜在的敏感信息,比如判断一个字符串是否像有效的家庭地址。如果查询中发现包含敏感信息,你有两个选择:阻止整个查询或从中移除敏感信息。例如,你可以用占位符[PHONE NUMBER]来遮掩用户的电话号码。如果生成的响应中包含这个占位符,可以使用一个 PII 可逆字典,将占位符映射回原始信息,以便解除遮掩,如下所示。

Overview of a genai platform


Model jailbreaking 模型破解

It’s become an online sport to try to jailbreak AI models, getting them to say or do bad things. While some might find it amusing to get ChatGPT to make controversial statements, it’s much less fun if your customer support chatbot, branded with your name and logo, does the same thing. This can be especially dangerous for AI systems that have access to tools. Imagine if a user finds a way to get your system to execute an SQL query that corrupts your data.
这已经成为了一种在线的挑战,试图破解 AI 模型,让它们说出或做出不良行为。虽然有些人可能会觉得让 ChatGPT 发表争议性言论很有趣,但如果您的客户支持聊天机器人,上面标有您的名字和标志,做同样的事情,那就没那么有趣了。这尤其对那些有工具访问权限的 AI 系统来说更加危险。想象一下,如果用户找到了一种方法,让您的系统执行一个破坏您数据的 SQL 查询,那将多么危险。


To combat this, you should first put guardrails on your system so that no harmful actions can be automatically executed. For example, no SQL queries that can insert, delete, or update data can be executed without human approval. The downside of this added security is that it can slow down your system.
为了解决这个问题,你首先应该在系统中设置安全限制,确保不会自动执行任何有害操作。例如,没有人工审批,任何能够插入、删除或更新数据的 SQL 查询都无法执行。这种增强安全性的措施的缺点是,可能会降低系统的运行速度。

To prevent your application from making outrageous statements it shouldn’t be making, you can define out-of-scope topics for your application. For example, if your application is a customer support chatbot, it shouldn’t answer political or social questions. A simple way to do so is to filter out inputs that contain predefined phrases typically associated with controversial topics, such as “immigration” or “antivax”. More sophisticated algorithms use AI to classify whether an input is about one of the pre-defined restricted topics.
为了防止你的应用程序发表不当言论,你可以为它设定一些“禁区”话题。比如,如果你的应用是一个客服聊天机器人,就不应该回答政治或社会问题。一个简单的做法是,过滤掉包含预设关键词的输入,这些词通常与争议性话题相关,比如“移民”或“反疫苗”。更高级的算法会利用人工智能,判断输入内容是否触及预设的敏感话题。


If harmful prompts are rare in your system, you can use an anomaly detection algorithm to identify unusual prompts.
若在您的系统里,有害提示出现的几率很低,可以运用异常检测算法来辨识异常的提示信息。

Output guardrails 输出保护栏

AI models are probabilistic, making their outputs unreliable. You can put in guardrails to significantly improve your application’s reliability. Output guardrails have two main functionalities:
人工智能模型具有概率性质,这导致其输出结果的可靠性受到限制。但你可以设置一些限制条件,以显著提升应用程序的可靠性。输出限制条件主要具备两大功能:

  1. Evaluate the quality of each generation.
    评估每一代的品质。
  2. Specify the policy to deal with different failure modes.
    制定策略以应对各种故障模式。

Output quality measurement
输出质量测量标准

To catch outputs that fail to meet your standards, you need to understand what failures look like. Here are examples of failure modes and how to catch them.
为了捕捉那些不符合你标准的输出,你需要知道失败的表现形式。以下是一些失败模式的例子,以及如何识别它们。

  1. Empty responses. 空的回应。

  2. Malformatted responses that don’t follow the expected output format. For example, if the application expects JSON and the generated response has a missing closing bracket. There are validators for certain formats, such as regex, JSON, and Python code validators. There are also tools for constrained sampling such as guidance, outlines, and instructor.
    格式错误的响应,不符合预期的输出格式。例如,如果应用程序期望的是 JSON 格式,而生成的响应缺少一个闭合的括号。对于某些格式,如正则表达式、JSON 和 Python 代码,有相应的验证器。还有用于约束采样的工具,如指导、大纲和指导者,使内容更符合要求。

  3. Toxic responses, such as those that are racist or sexist. These responses can be caught using one of many toxicity detection tools.
    像种族主义或性别歧视这样的有毒回应,可以利用多种毒性检测工具来识别。

  4. Factual inconsistent responses hallucinated by the model. Hallucination detection is an active area of research with solutions such as SelfCheckGPT (Manakul et al., 2023) and SAFE, Search Engine Factuality Evaluator (Wei et al., 2024). You can mitigate hallucinations by providing models with sufficient context and prompting techniques such as chain-of-thought. Hallucination detection and mitigation are discussed further in my upcoming book AI Engineering.
    由模型生成的事实不一致的幻觉响应。幻觉检测是当前研究的热点,解决方案包括 SelfCheckGPT(Manakul 等人,2023 年)和 SAFE,即搜索引擎事实性评估器(Wei 等人,2024 年)。您可以通过向模型提供充分的上下文和采用诸如思维链等提示技巧来减少幻觉的产生。在我的即将出版的书籍《AI 工程》中,将深入探讨幻觉检测和缓解策略。

  5. Responses that contain sensitive information. This can happen in two scenarios.
    包含敏感信息的回应。这种情况可能在两种情境下发生。
    1. Your model was trained on sensitive data and regurgitates it back.
      您的模型是在敏感数据上训练的,会将这些数据反馈出来。
    2. Your system retrieves sensitive information from your internal database to enrich its context, and then it passes this sensitive information on to the response.
      为了丰富其上下文,您的系统会从内部数据库中检索敏感信息,并将这些敏感信息传递给响应。

    This failure mode can be prevented by not training your model on sensitive data and not allowing it to retrieve sensitive data in the first place. Sensitive data in outputs can be detected using the same tools used for input guardrails.
    要避免这种故障模式,首先不要在敏感数据上训练模型,其次要禁止模型检索敏感数据。可以利用与检测输入数据相同的工具来识别输出中的敏感数据。

  6. Brand-risk responses, such as responses that mischaracterize your company or your competitors. An example is when Grok, a model trained by X, generated a response suggesting that Grok was trained by OpenAI, causing the Internet to suspect X of stealing OpenAI’s data. This failure mode can be mitigated with keyword monitoring. Once you’ve identified outputs concerning your brands and competitors, you can either block these outputs, pass them onto human reviewers, or use other models to detect the sentiment of these outputs to ensure that only the right sentiments are returned.
    品牌风险的应对措施,比如错误描述你公司或竞争对手的回应。例如,由 X 训练的模型 Grok 生成了一个暗示 Grok 由 OpenAI 训练的回应,这导致互联网怀疑 X 窃取了 OpenAI 的数据。这种失败模式可以通过关键词监控来缓解。一旦你确定了与你的品牌和竞争对手相关的输出,你可以选择阻止这些输出,将其转交给人工审查员,或者使用其他模型来检测这些输出的情感,以确保只返回正确的情感。

  7. Generally bad responses. For example, if you ask the model to write an essay and that essay is just bad, or if you ask the model for a low-calorie cake recipe and the generated recipe contains an excessive amount of sugar. It’s become a popular practice to use AI judges to evaluate the quality of models’ responses. These AI judges can be general-purpose models (think ChatGPT, Claude) or specialized scorers trained to output a concrete score for a response given a query.
    通常的糟糕回应。例如,如果你要求模型写一篇文章,而那篇文章质量很差,或者你要求模型提供低热量蛋糕的食谱,而生成的食谱却含有过多的糖。现在,使用 AI 裁判来评估模型回应的质量已经成为一种普遍的做法。这些 AI 裁判可以是通用型模型(如 ChatGPT,Claude)或者是专门训练的评分模型,用于根据查询给出的回应打分。

Failure management 失效管理

AI models are probabilistic, which means that if you try a query again, you might get a different response. Many failures can be mitigated using a basic retry logic. For example, if the response is empty, try again X times or until you get a non-empty response. Similarly, if the response is malformatted, try again until the model generates a correctly formatted response.
人工智能模型具有概率性,也就是说,当你重复进行查询时,可能会得到不同的答案。许多问题可以通过采用基础的重试策略来解决。例如,如果收到的回复是空白的,可以尝试重试 X 次,或者直到收到非空白的回复为止。同样地,如果回复格式不正确,可以持续重试,直到模型生成格式正确的回复。

This retry policy, however, can incur extra latency and cost. One retry means 2x the number of API calls. If the retry is carried out after failure, the latency experienced by the user will double. To reduce latency, you can make calls in parallel. For example, for each query, instead of waiting for the first query to fail before retrying, you send this query to the model twice at the same time, get back two responses, and pick the better one. This increases the number of redundant API calls but keeps latency manageable.
然而,这种重试策略可能会导致额外的延迟和成本增加。一次重试意味着 API 调用数量翻倍。如果在失败后进行重试,用户所经历的延迟将会加倍。为了减少延迟,您可以并行进行调用。例如,对于每个查询,不必等到第一次查询失败后再重试,而是可以同时将这个查询发送给模型两次,获取两个响应,然后选择更好的一个。这虽然增加了冗余的 API 调用,但保持了延迟在可管理的范围内。

It’s also common to fall back on humans to handle tricky queries. For example, you can transfer a query to human operators if it contains specific key phrases. Some teams use a specialized model, potentially trained in-house, to decide when to transfer a conversation to humans. One team, for instance, transfers a conversation to human operators when their sentiment analysis model detects that the user is getting angry. Another team transfers a conversation after a certain number of turns to prevent users from getting stuck in an infinite loop.
当遇到难以处理的查询时,转交由人工处理是很常见的做法。例如,如果查询中包含特定的关键字或短语,可以将其转交给人工客服。一些团队会使用专门的模型,可能是在内部训练的,来判断何时应将对话转交给人类处理。例如,有一个团队在情感分析模型检测到用户情绪变得愤怒时,会将对话转交给人工客服。另一个团队则在对话达到一定轮次后进行转交,以避免用户陷入无休止的循环中。

Guardrail tradeoffs 护栏的权衡考量

Reliability vs. latency tradeoff: While acknowledging the importance of guardrails, some teams told me that latency is more important. They decided not to implement guardrails because they can significantly increase their application’s latency. However, these teams are in the minority. Most teams find that the increased risks are more costly than the added latency.
在可靠性和延迟之间的权衡:尽管认识到保护措施的重要性,但有些团队向我表示,降低延迟是更为关键的。他们选择不实施保护措施,因为这可能会大幅增加其应用程序的延迟。然而,这样的团队只是少数。大多数团队认为,增加的风险所带来的成本,远超过延迟增加所带来的影响。

Output guardrails might not work well in the stream completion mode. By default, the whole response is generated before shown to the user, which can take a long time. In the stream completion mode, new tokens are streamed to the user as they are generated, reducing the time the user has to wait to see the response. The downside is that it’s hard to evaluate partial responses, so unsafe responses might be streamed to users before the system guardrails can determine that they should be blocked.
在流完成模式下,输出限制可能无法很好地发挥作用。默认设置下,整个回应在展示给用户前会全部生成,这可能需要很长时间。而在流完成模式下,新生成的令牌会实时传输给用户,从而减少了用户等待查看回应的时间。然而,这种模式下很难评估部分回应,因此在系统限制确定应阻止某些不安全回应前,它们可能已被实时传输给了用户。

Self-hosted vs. third-party API tradeoff: Self-hosting your models means that you don’t have to send your data to a third party, reducing the need for input guardrails. However, it also means that you must implement all the necessary guardrails yourself, rather than relying on the guardrails provided by third-party services.
自托管与第三方 API 的权衡:自托管模型意味着无需将数据发送给第三方,从而减少了对输入数据保护的需求。但是,这也意味着您必须自行实施所有必要的数据保护措施,而不能依赖第三方服务提供的保护。

Our platform now looks like this. Guardrails can be independent tools or parts of model gateways, as discussed later. Scorers, if used, are grouped under model APIs since scorers are typically AI models, too. Models used for scoring are typically smaller and faster than models used for generation.
我们的平台现在呈现如下面貌。护栏既可以是独立的工具,也可以是模型网关的组成部分,具体如后文所述。如果使用了评分器,它们通常会被归类在模型 API 下,因为评分器本身也是 AI 模型的一种。用于评分的模型通常比用于生成的模型更小、更高效。

Overview of a genai platform


Step 3. Add Model Router and Gateway
第 3 步。添加模型路由器和网关

As applications grow in complexity and involve more models, two types of tools emerged to help you work with multiple models: routers and gateways.
随着应用程序复杂度的提升及涉及的模型增多,两种工具应运而生,以协助处理多模型:路由器与网关。

Router 路由器

An application can use different models to respond to different types of queries. Having different solutions for different queries has several benefits. First, this allows you to have specialized solutions, such as one model specialized in technical troubleshooting and another specialized in subscriptions. Specialized models can potentially perform better than a general-purpose model. Second, this can help you save costs. Instead of routing all queries to an expensive model, you can route simpler queries to cheaper models.
一个应用程序可以利用多种模型来应对不同类型的查询。对于不同类型的查询采取不同的解决方案,有几点好处。首先,这使得你可以拥有专门的解决方案,比如一个模型专注于技术故障排除,另一个模型则专注于处理订阅问题。专门的模型可能比通用模型表现更出色。其次,这有助于节省成本。无需将所有查询都导向成本较高的模型,你可以将较为简单的查询导向成本较低的模型。

A router typically consists of an intent classifier that predicts what the user is trying to do. Based on the predicted intent, the query is routed to the appropriate solution. For example, for a customer support chatbot, if the intent is:
路由器通常包含一个意图分类器,用于预测用户的目的。根据预测的意图,查询会被导向至适当的解决方案。例如,对于客户支持聊天机器人,如果预测的意图是:

  • To reset a password –> route this user to the page about password resetting.
    若要重置密码,请将此用户引导至密码重置页面。
  • To correct a billing mistake –> route this user to a human operator.
    若要纠正计费错误,请将此用户转接到人工客服。
  • To troubleshoot a technical issue –> route this query to a model finetuned for troubleshooting.
    要解决技术问题,应将此问题转交给专门针对故障排除训练的模型处理。

An intent classifier can also help your system avoid out-of-scope conversations. For example, you can have an intent classifier that predicts whether a query is out of the scope. If the query is deemed inappropriate (e.g. if the user asks who you would vote for in the upcoming election), the chatbot can politely decline to engage using one of the stock responses (“As a chatbot, I don’t have the ability to vote. If you have questions about our products, I’d be happy to help.”) without wasting an API call.
意图分类器能有效防止系统陷入无关对话。例如,通过意图分类器预测用户提问是否超出服务范围。如果问题不恰当(比如用户询问你会在即将举行的选举中投谁的票),聊天机器人会礼貌地拒绝回答,使用预设回复之一(“作为聊天机器人,我没有投票的权限。如果您有关于我们产品的问题,我很乐意为您解答。”),这样就不会浪费一次 API 调用。

If your system has access to multiple actions, a router can involve a next-action predictor to help the system decide what action to take next. One valid action is to ask for clarification if the query is ambiguous. For example, in response to the query “Freezing,” the system might ask, “Do you want to freeze your account or are you talking about the weather?” or simply say, “I’m sorry. Can you elaborate?”
如果您的系统能访问多种行动,路由器可以运用下一个行动预测器,帮助系统决定下一步该采取什么行动。如果查询语句模糊不清,一个合理的行动是要求对方澄清。例如,对于“冻结”这个查询,系统可能会问:“您是想冻结账户,还是在谈论天气?”或者干脆说:“对不起,您能详细说明一下吗?”

Intent classifiers and next-action predictors can be general-purpose models or specialized classification models. Specialized classification models are typically much smaller and faster than general-purpose models, allowing your system to use multiple of them without incurring significant extra latency and cost.
意图分类器和下一步动作预测器可以是通用模型,也可以是专门的分类模型。专门的分类模型通常比通用模型更小、更快,这使得您的系统可以使用多个此类模型,而不会导致显著的额外延迟和成本增加。

When routing queries to models with varying context limits, the query’s context might need to be adjusted accordingly. Consider a query of 1,000 tokens that is slated for a model with a 4K context limit. The system then takes an action, e.g. web search, that brings back 8,000-token context. You can either truncate the query’s context to fit the originally intended model or route the query to a model with a larger context limit.
在将查询路由到具有不同上下文限制的模型时,可能需要相应地调整查询的上下文。例如,一个计划用于 4K 上下文限制模型的 1000 标记查询,在系统执行网络搜索等操作后,可能带回 8000 标记的上下文。这时,你可以选择将查询的上下文截断以适应原计划的模型,或者将查询路由到具有更大上下文限制的模型,以更好地处理查询。

Gateway 门户

A model gateway is an intermediate layer that allows your organization to interface with different models in a unified and secure manner. The most basic functionality of a model gateway is to enable developers to access different models – be it self-hosted models or models behind commercial APIs such as OpenAI or Google – the same way. A model gateway makes it easier to maintain your code. If a model API changes, you only need to update the model gateway instead of having to update all applications that use this model API.
模型网关作为中间层,使您的组织能够以统一且安全的方式与各种模型进行交互。模型网关最基本的功能是让开发人员能够以相同的方式访问不同的模型,无论是自托管模型还是通过 OpenAI 或 Google 等商业 API 提供的模型。使用模型网关,可以简化代码维护工作。当模型 API 发生变化时,您只需更新模型网关,而无需逐一更新所有使用该模型 API 的应用程序。

Overview of a genai platform


In its simplest form, a model gateway is a unified wrapper that looks like the following code example. This example is to give you an idea of how a model gateway might be implemented. It’s not meant to be functional as it doesn’t contain any error checking or optimization.
以最简形式而言,模型网关就像是一个统一的封装层,其外观类似于以下的代码示例。这个示例旨在让您对模型网关的实现方式有所了解。请注意,这个示例并不具备实际功能,因为它并未包含任何错误检查或优化措施。

import google.generativeai as genai
import openai

def openai_model(input_data, model_name, max_tokens):
    openai.api_key = os.environ["OPENAI_API_KEY"]
    response = openai.Completion.create(
        engine=model_name,
        prompt=input_data,
        max_tokens=max_tokens
    )
    return {"response": response.choices[0].text.strip()}

def gemini_model(input_data, model_name, max_tokens):
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    model = genai.GenerativeModel(model_name=model_name)
    response = model.generate_content(input_data, max_tokens=max_tokens)
    return {"response": response["choices"][0]["message"]["content"]}

@app.route('/model', methods=['POST'])
def model_gateway():
    data = request.get_json()
    model_type = data.get("model_type")
    model_name = data.get("model_name")
    input_data = data.get("input_data")
    max_tokens = data.get("max_tokens")

    if model_type == "openai":
        result = openai_model(input_data, model_name, max_tokens)
    elif model_type == "gemini":
        result = gemini_model(input_data, model_name, max_tokens)
    return jsonify(result)

A model gateway is access control and cost management. Instead of giving everyone who wants access to the OpenAI API your organizational tokens, which can be easily leaked, you only give people access to the model gateway, creating a centralized and controlled point of access. The gateway can also implement fine-grained access controls, specifying which user or application should have access to which model. Moreover, the gateway can monitor and limit the usage of API calls, preventing abuse and managing costs effectively.
模型网关实现了访问控制和成本管理。与将组织的 OpenAI API 访问令牌随意分发给所有人(这很容易导致泄露)相比,你只需让人们访问模型网关,就能创建一个集中且受控的访问点。网关可以实施精细的访问控制,明确指定哪些用户或应用程序可以访问哪些模型。此外,网关能够监控并限制 API 调用的使用,有效防止滥用并管理成本。

A model gateway can also be used to implement fallback policies to overcome rate limits or API failures (the latter is unfortunately common). When the primary API is unavailable, the gateway can route requests to alternative models, retry after a short wait, or handle failures in other graceful manners. This ensures that your application can operate smoothly without interruptions.
模型网关亦可实现回退策略,以应对速率限制或 API 故障(后者遗憾的是颇为常见)。当主 API 无法访问时,网关能将请求导向备选模型,短暂等待后重试,或以其他更为优雅的方式处理故障。这确保了应用程序运行流畅,避免了中断,从而提升了用户体验。

Since requests and responses are already flowing through the gateway, it’s a good place to implement other functionalities such as load balancing, logging, and analytics. Some gateway services even provide caching and guardrails.
既然请求与响应已经在网关中流动,这里很适合实现其他功能,比如负载均衡、日志记录和数据分析。有些网关服务甚至提供了缓存和安全防护。

Given that gateways are relatively straightforward to implement, there are many off-the-shelf gateways. Examples include Portkey’s gateway, MLflow AI Gateway, WealthSimple’s llm-gateway, TrueFoundry, Kong, and Cloudflare.
由于网关相对容易实现,市面上有许多现成的网关可供选择。例如,Portkey 的网关,MLflow AI 网关,WealthSimple 的llm-网关,TrueFoundry,Kong 以及 Cloudflare。

With the added gateway and routers, our platform is getting more exciting. Like scoring, routing is also in the model gateway. Like models used for scoring, models used for routing are typically smaller than models used for generation.
随着新增的网关和路由器,我们的平台正变得越来越令人兴奋。就像评分功能一样,路由功能也集成在模型网关中。用于路由的模型通常比用于生成的模型小,这与用于评分的模型类似。

Overview of a genai platform


Step 4. Reduce Latency with Cache
第 4 步。使用缓存减少延迟

When I shared this post with my friend Eugene Yan, he said that cache is perhaps the most underrated component of an AI platform. Caching can significantly reduce your application’s latency and cost.
当我与我的朋友尤金·燕分享这篇帖子时,他认为缓存或许是人工智能平台中最被低估的组件。使用缓存能显著减少应用程序的延迟并降低运行成本。

Cache techniques can also be used during training, but since this post is about deployment, I’ll focus on cache for inference. Some common inference caching techniques include prompt cache, exact cache, and semantic cache. Prompt cache are typically implemented by the inference APIs that you use. When evaluating an inference library, it’s helpful to understand what cache mechanism it supports.
缓存技术虽然同样适用于训练阶段,但鉴于本文主题为部署,我将重点讲述推理过程中的缓存应用。常见的推理缓存技术有提示缓存、精确缓存和语义缓存等。通常,提示缓存的实现会集成在你所使用的推理 API 中。在评估推理库时,了解其支持的缓存机制将大有裨益。

KV cache for the attention mechanism is out of scope for this discussion.
关于注意力机制的 KV 缓存,不在我们这次讨论的范围内。

Prompt cache 提示信息缓存区

Many prompts in an application have overlapping text segments. For example, all queries can share the same system prompt. A prompt cache stores these overlapping segments for reuse, so you only need to process them once. A common overlapping text segment in different prompts is the system prompt. Without prompt cache, your model needs to process the system prompt with every query. With prompt cache, it only needs to process the system prompt once for the first query.
在应用程序中,许多提示包含重复的文本片段。例如,所有查询可以共用相同的系统提示。提示缓存会存储这些重复的文本片段以便重复使用,这样就只需要处理一次。在不同的提示中,系统提示是一个常见的重复文本片段。如果没有提示缓存,模型在每次查询时都需要处理系统提示。而有了提示缓存,模型只需在第一次查询时处理系统提示一次即可。

For applications with long system prompts, prompt cache can significantly reduce both latency and cost. If your system prompt is 1000 tokens and your application generates 1 million model API calls today, a prompt cache will save you from processing approximately 1 billion repetitive input tokens a day! However, this isn’t entirely free. Like KV cache, prompt cache size can be quite large and require significant engineering effort.
对于拥有长系统提示的应用,使用提示缓存能显著降低延迟和成本。假设你的系统提示有 1000 个标记,而你的应用今天生成了 100 万次模型 API 调用,那么提示缓存每天可以帮你节省大约 10 亿个重复输入标记的处理!然而,这并非完全无代价。与 KV 缓存类似,提示缓存的规模可能相当庞大,需要投入相当的工程精力。

Prompt cache is also useful for queries that involve long documents. For example, if many of your user queries are related to the same long document (such as a book or a codebase), this long document can be cached for reuse across queries.
提示缓存对于处理长文档的查询特别有帮助。比如,如果用户的许多查询都围绕同一份长文档,如一本书或一个代码库,那么这份长文档就可以被缓存,以供后续查询重复使用,提高效率。

Since its introduction in November 2023 by Gim et al., prompt cache has already been incorporated into model APIs. Google announced that Gemini APIs will offer this functionality in June 2024 under the name context cache. Cached input tokens are given a 75% discount compared to regular input tokens, but you’ll have to pay extra for cache storage (as of writing, $1.00 / 1 million tokens per hour). Given the obvious benefits of prompt cache, I wouldn’t be surprised if it becomes as popular as KV cache.
自从 2023 年 11 月 Gim 等人首次提出以来,提示缓存技术已经被整合到模型 API 中。Google 宣布,Gemini API 将在 2024 年 6 月以“上下文缓存”的名义提供这一功能。与常规输入令牌相比,缓存输入令牌可享受 75%的折扣,但您需要为缓存存储支付额外费用(撰写本文时,每小时每 100 万令牌 1.00 美元)。鉴于提示缓存的明显优势,我毫不惊讶它会像 KV 缓存一样流行。

While llama.cpp also has prompt cache, it seems to only cache whole prompts and work for queries in the same chat session. Its documentation is limited, but my guess from reading the code is that in a long conversation, it caches the previous messages and only processes the newest message.
尽管 llama.cpp 具备提示缓存功能,但似乎仅能缓存整个提示信息,且仅适用于同一聊天会话中的查询。其文档资料有限,但从我阅读代码的推测,在长时间的对话中,它会缓存之前的对话内容,仅对最新消息进行处理。

Exact cache 精确的缓存

If prompt cache and KV cache are unique to foundation models, exact cache is more general and straightforward. Your system stores processed items for reuse later when the exact items are requested. For example, if a user asks a model to summarize a product, the system checks the cache to see if a summary of this product is cached. If yes, fetch this summary. If not, summarize the product and cache the summary.
若基础模型特有的提示缓存和 KV 缓存相比,精确缓存显得更为通用且直接。你的系统会存储已处理的项目,以便在后续请求完全相同的项目时能够重用。例如,当用户要求模型对某个产品进行总结时,系统会先检查缓存中是否已有该产品的摘要。如果存在,直接调用这个摘要;如果不存在,则对产品进行总结,并将摘要存入缓存。

Exact cache is also used for embedding-based retrieval to avoid redundant vector search. If an incoming query is already in the vector search cache, fetch the cached search result. If not, perform a vector search for this query and cache the result.
精确缓存同样应用于基于嵌入的检索,以此避免重复的向量搜索。如果接收到的查询已经在向量搜索的缓存中,直接调用缓存的搜索结果。反之,则对查询进行向量搜索,并将结果缓存起来。

Cache is especially appealing for queries that require multiple steps (e.g. chain-of-thought) and/or time-consuming actions (e.g. retrieval, SQL execution, or web search).
对于那些需要多步操作(如链式思考)或耗时任务(如信息检索、执行 SQL 语句、网络搜索)的查询,缓存显得尤为吸引人。

An exact cache can be implemented using in-memory storage for fast retrieval. However, since in-memory storage is limited, a cache can also be implemented using databases like PostgreSQL, Redis, or tiered storage to balance speed and storage capacity. Having an eviction policy is crucial to manage the cache size and maintain performance. Common eviction policies include Least Recently Used (LRU), Least Frequently Used (LFU), and First In, First Out (FIFO).
可以利用内存存储实现精确的缓存,以达到快速检索的目的。然而,由于内存存储的容量有限,我们也可以采用 PostgreSQL、Redis 或分层存储等数据库来实现缓存,以此来平衡速度和存储容量。拥有淘汰策略对于管理缓存大小和维护性能至关重要。常见的淘汰策略包括最近最少使用(LRU)、最不经常使用(LFU)和先进先出(FIFO)。

How long to cache a query depends on how likely this query is to be called again. User-specific queries such as “What’s the status of my recent order” are less likely to be reused by other users, and therefore, shouldn’t be cached. Similarly, it makes less sense to cache time-sensitive queries such as “How’s the weather?” Some teams train a small classifier to predict whether a query should be cached.
查询的缓存时长取决于其再次被调用的可能性。例如,“我的最近订单状态如何?”这类针对特定用户的问题,不太可能被其他用户重复使用,因此不应被缓存。同样地,像“天气如何?”这种时效性问题,缓存也没有太大意义。一些团队会训练一个小型分类器来预测某个查询是否值得缓存。

Semantic cache 语义缓存库

Unlike exact cache, semantic cache doesn’t require the incoming query to be identical to any of the cached queries. Semantic cache allows the reuse of similar queries. Imagine one user asks “What’s the capital of Vietnam?” and the model generates the answer “Hanoi”. Later, another user asks “What’s the capital city of Vietnam?”, which is the same question but with the extra word “city”. The idea of semantic cache is that the system can reuse the answer “Hanoi” instead of computing the new query from scratch.
与精确缓存不同,语义缓存并不需要新查询与缓存中的任何查询完全一致。语义缓存允许系统重用相似的查询。例如,一个用户问“越南的首都是哪里?”模型回答“河内”。之后,另一个用户问“越南的首都是哪个城市?”虽然问题中多了一个词“城市”,但其实质相同。语义缓存的精髓在于,系统可以重用“河内”这个答案,而无需重新计算。

Semantic cache only works if you have a reliable way to determine if two queries are semantically similar. One common approach is embedding-based similarity, which works as follows:
语义缓存仅在您有可靠的方法判断两个查询在语义上是否相似时才生效。一种常见的做法是采用基于嵌入的相似性,其工作原理如下:

  1. For each query, generate its embedding using an embedding model.
    对于每个查询,利用嵌入模型来生成相应的向量表示。
  2. Use vector search to find the cached embedding closest to the current query embedding. Let’s say this similarity score is X.
    利用向量搜索技术,找到与当前查询嵌入最为接近的缓存嵌入。假设这个相似度得分为 X。
  3. If X is less than the similarity threshold you set, the cached query is considered the same as the current query, and the cached results are returned. If not, process this current query and cache it together with its embedding and results.
    若 X 小于您设定的相似度阈限,缓存的查询将被视为与当前查询相同,此时会返回缓存中的结果。反之,则需处理当前查询,并将其连同其嵌入信息和查询结果一并缓存。

This approach requires a vector database to store the embeddings of cached queries.
这种方法需要使用向量数据库来存储已缓存查询的嵌入式数据。

Compared to other caching techniques, semantic cache’s value is more dubious because many of its components are prone to failure. Its success relies on high-quality embeddings, functional vector search, and a trustworthy similarity metric. Setting the right similarity threshold can also be tricky and require a lot of trial and error. If the system mistakes the incoming query as being similar to another query, the returned response, fetched from the cache, will be incorrect.
与其它缓存技术相比,语义缓存的价值显得更加存疑,因为其许多组成部分容易发生故障。它的成功取决于高质量的嵌入式编码、功能向量搜索以及可靠的相似性度量。设置正确的相似度阈值也颇具挑战,往往需要大量的尝试与调整。如果系统错误地将新查询识别为与另一查询相似,那么从缓存中返回的响应就会出错。

In addition, semantic cache can be time-consuming and compute-intensive, as it involves a vector search. The speed and cost of this vector search depend on the size of your database of cached embeddings.
此外,语义缓存可能非常耗时且计算密集,因为它涉及到向量搜索。向量搜索的速度和成本取决于你缓存嵌入数据库的大小。这可能会影响整体性能和效率。

Semantic cache might still be worth it if the cache hit rate is high, meaning that a good portion of queries can be effectively answered by leveraging the cached results. However, before incorporating the complexities of semantic cache, make sure to evaluate the efficiency, cost, and performance risks associated with it.
如果语义缓存的命中率很高,意味着可以通过利用缓存的结果有效地回答大部分查询,那么语义缓存可能仍然值得使用。但是,在引入语义缓存的复杂性之前,一定要评估其效率,成本和性能风险。

With the added cache systems, the platform looks as follows. KV cache and prompt cache are typically implemented by model API providers, so they aren’t shown in this image. If I must visualize them, I’d put them in the Model API box. There’s a new arrow to add generated responses to the cache.
在新增了缓存系统后,平台的架构如下所示。通常,KV 缓存和提示缓存是由模型 API 提供商实现的,所以在这张图中并未显示。如果我必须将它们可视化,我会把它们放在模型 API 的框框里。图中新增了一条箭头,表示将生成的响应加入缓存。

Overview of a genai platform


Step 5. Add complex logic and write actions
第五步,添加复杂的逻辑并编写操作步骤。

The applications we’ve discussed so far have fairly simple flows. The outputs generated by foundation models are mostly returned to users (unless they don’t pass the guardrails). However, an application flow can be more complex with loops and conditional branching. A model’s outputs can also be used to invoke write actions, such as composing an email or placing an order.
我们到目前为止讨论的应用程序流程相对简单。基础模型产生的输出大多直接返回给用户(除非它们未能通过安全检查)。然而,应用程序的流程可能更为复杂,包含循环和条件分支。模型的输出还可以用于触发写入操作,比如撰写电子邮件或下订单。

Complex logic 复杂的逻辑处理

Outputs from a model can be conditionally passed onto another model or fed back to the same model as part of the input to the next step. This goes on until a model in the system decides that the task has been completed and that a final response should be returned to the user.
来自一个模型的输出可以依据条件传递给另一个模型,或者作为下一次输入循环反馈至同一模型。这一过程会持续进行,直到系统中的某个模型判定任务已完成,且应向用户返回最终结果。

This can happen when you give your system the ability to plan and decide what to do next. As an example, consider the query “Plan a weekend itinerary for Paris.” The model might first generate a list of potential activities: visiting the Eiffel Tower, having lunch at a café, touring the Louvre, etc. Each of these activities can then be fed back into the model to generate more detailed plans. For instance, “visiting the Eiffel Tower” could prompt the model to generate sub-tasks like checking the opening hours, buying tickets, and finding nearby restaurants. This iterative process continues until a comprehensive and detailed itinerary is created.
当你赋予系统规划和决定下一步做什么的能力时,这可能发生。例如,考虑查询“为巴黎的周末制定行程。”模型可能会首先生成一系列可能的活动:参观埃菲尔铁塔,去咖啡馆吃午餐,参观卢浮宫等。然后,这些活动可以反馈到模型中,以生成更详细的计划。例如,“参观埃菲尔铁塔”可能会促使模型生成子任务,如检查开放时间,购买门票,寻找附近的餐馆。这个迭代过程会一直持续,直到创建出一个全面而详细的行程。我们对这个过程进行优化,使其在中文中听起来更自然:当你赋予系统规划和决定下一步做什么的能力时,这种情况就可能发生。例如,考虑一下这样的查询:“为巴黎的周末制定行程。”模型可能会首先生成一系列可能的活动:参观埃菲尔铁塔,去咖啡馆吃午餐,参观卢浮宫等。然后,这些活动可以反馈到模型中,以生成更详细的计划。例如,“参观埃菲尔铁塔”可能会促使模型生成子任务,如检查开放时间,购买门票,寻找附近的餐馆。这个迭代过程会一直持续,直到创建出一个全面而详细的行程。

Our infrastructure now has an arrow pointing the generated response back to context construction, which in turn feeds back to models in the model gateway.
我们的基础设施现已具备一个反馈机制,将生成的响应重新导向上下文构建环节,进而将信息回传至模型网关中的各个模型。

Overview of a genai platform


Write actions 撰写操作

Actions used for context construction are read-only actions. They allow a model to read from its data sources to gather context. But a system can also write actions, making changes to the data sources and the world. For example, if the model outputs: “send an email to X with the message Y”, the system will invoke the action send_email(recipient=X, message=Y).
用于构建上下文的动作是只读的,它们允许模型从数据源读取信息以收集上下文。但系统也能执行写入动作,对数据源和现实世界做出改变。例如,如果模型输出“向 X 发送内容为 Y 的邮件”,系统将会调用 send_email(收件人=X, 内容=Y)这个动作。

Write actions make a system vastly more capable. They can enable you to automate the whole customer outreach workflow: researching potential customers, finding their contacts, drafting emails, sending first emails, reading responses, following up, extracting orders, updating your databases with new orders, etc.
写入操作极大地提升了系统的功能。它们能帮助你实现客户拓展全流程自动化,包括研究潜在客户、查找客户联系方式、撰写邮件、发送首封邮件、阅读并处理回复、跟进客户、提取订单以及用新订单更新数据库等。

However, the prospect of giving AI the ability to automatically alter our lives is frightening. Just as you shouldn’t give an intern the authority to delete your production database, you shouldn’t allow an unreliable AI to initiate bank transfers. Trust in the system’s capabilities and its security measures is crucial. You need to ensure that the system is protected from bad actors who might try to manipulate it into performing harmful actions.
然而,让 AI 拥有自动改变我们生活的能力,这种前景令人感到恐惧。正如你不会授权实习生删除生产数据库,你也不应该让不可靠的 AI 发起银行转账。对系统的能力和安全措施的信任至关重要。你需要确保系统受到保护,防止有恶意的人试图操纵系统执行有害行为。

AI systems are vulnerable to cyber attacks like other software systems, but they also have another weakness: prompt injection. Prompt injection happens when an attacker manipulates input prompts into a model to get it to express undesirable behaviors. You can think of prompt injection as social engineering done on AI instead of humans.
AI 系统如同其他软件系统一样,面临着网络攻击的威胁,但它们还存在一个特殊的弱点:提示注入。当攻击者篡改输入提示,诱使 AI 模型产生不希望的行为时,就发生了提示注入。可以将提示注入理解为一种针对 AI 的社会工程手段,而非针对人类。

A scenario that many companies fear is that they give an AI system access to their internal databases, and attackers trick this system into revealing private information from these databases. If the system has write access to these databases, attackers can trick the system into corrupting the data.
许多企业担忧的一种情况是,他们允许 AI 系统访问内部数据库,而黑客可能利用这一点,诱骗系统泄露数据库中的敏感信息。如果系统对这些数据库有写入权限,黑客甚至可能诱骗系统篡改数据,造成破坏。

Any organization that wants to leverage AI needs to take safety and security seriously. However, these risks don’t mean that AI systems should never be given the ability to act in the real world. AI systems can fail, but humans can fail too. If we can get people to trust a machine to take us up into space, I hope that one day, securities will be sufficient for us to trust autonomous AI systems.
任何想要利用人工智能的组织都必须高度重视安全和保障问题。但这并不意味着应该完全禁止 AI 系统在现实世界中行动。AI 系统确实可能出错,但人类同样会犯错。既然我们能让人们信任机器带我们进入太空,那么我希望未来某天,我们能够建立足够的保障措施,让人们信任自主的 AI 系统。

Overview of a genai platform


Observability 可观测性

While I have placed observability in its own section, it should be integrated into the platform from the beginning rather than added later as an afterthought. Observability is crucial for projects of all sizes, and its importance grows with the complexity of the system.
尽管我将可观察性单独列出,但它应从平台构建之初就融入其中,而非事后才作为补充。无论项目规模大小,可观察性都至关重要,且随着系统复杂度的提升,其重要性愈发凸显。

This section provides the least information compared to the others. It’s impossible to cover all the nuances of observability in a blog post. Therefore, I will only give a brief overview of the three pillars of monitoring: logs, traces, and metrics. I won’t go into specifics or cover user feedback, drift detection, and debugging.
这一节提供的信息量相对较少。在一篇博客文章中全面覆盖可观测性的所有细节是不可能的。因此,我只会简要介绍监控的三大支柱:日志、追踪和指标。我不会深入具体细节,也不会涉及用户反馈、漂移检测和调试这些内容。

Metrics 指标数据

When discussing monitoring, most people think of metrics. What metrics to track depends on what you want to track about your system, which is application-specific. However, in general, there are two types of metrics you want to track: model metrics and system metrics.
讨论监控时,多数人会想到指标。具体跟踪哪些指标,取决于你想监控的系统特性,这与具体应用相关。但通常,有两类指标需要关注:模型指标和系统指标。

System metrics tell you the state of your overall system. Common metrics are throughput, memory usage, hardware utilization, and service availability/uptime. System metrics are common to all software engineering applications. In this post, I’ll focus on model metrics.
系统指标能揭示整个系统的运行状况。常见的指标包括吞吐量、内存使用率、硬件利用率以及服务的可用性和运行时间。这些系统指标在所有软件工程应用中都是通用的。在本文中,我将着重探讨模型指标。

Model metrics assess your model’s performance, such as accuracy, toxicity, and hallucination rate. Different steps in an application pipeline also have their own metrics. For example, in a RAG application, the retrieval quality is often evaluated using context relevance and context precision. A vector database can be evaluated by how much storage it needs to index the data and how long it takes to query the data
模型的指标用于评估模型的性能,比如准确性、毒性以及幻觉率。在应用程序的各个阶段,也有其特定的评估指标。例如,在 RAG 应用中,检索质量通常通过上下文的相关性和精确度来衡量。对于向量数据库,其评估标准可能包括存储数据所需的容量,以及查询数据所需的时间。

There are various ways a model’s output can fail. It’s crucial to identify these issues and develop metrics to monitor them. For example, you might want to track how often your model times out, returns empty responses or produces malformatted responses. If you’re worried about your model revealing sensitive information, find a way to track that too.
模型的输出可能会以多种方式出错。识别这些问题并开发相应的监控指标至关重要。例如,你可能需要追踪模型超时、返回空响应或产生格式错误响应的频率。如果你担心模型可能会泄露敏感信息,也应该找到方法来追踪这一点。

Length-related metrics such as query, context, and response length are helpful for understanding your model’s behaviors. Is one model more verbose than another? Are certain types of queries more likely to result in lengthy answers? They are especially useful for detecting changes in your application. If the average query length suddenly decreases, it could indicate an underlying issue that needs investigation.
像查询、上下文和响应长度这样的长度相关指标,对于理解你的模型行为非常有帮助。一个模型是否比另一个更话多?某些类型的提问是否更可能得到长篇大论的回答?它们对于发现你的应用变化尤其有用。如果平均提问长度突然变短,可能意味着存在需要调查的潜在问题。

Length-related metrics are also important for tracking latency and costs, as longer contexts and responses typically increase latency and incur higher costs.
与长度相关的指标对于监控延迟和成本同样重要,因为上下文和响应越长,通常会导致延迟增加,成本也更高。

Tracking latency is essential for understanding the user experience. Common latency metrics include:
跟踪延迟是理解用户体验的关键。常见的延迟指标有:

  • Time to First Token (TTFT): The time it takes for the first token to be generated.
    首个令牌生成时间(TTFT):这是生成第一个令牌所需的时间。
  • Time Between Tokens (TBT): The interval between each token generation.
    令牌间隔时间(TBT):每次生成令牌之间的间隔。
  • Tokens Per Second (TPS): The rate at which tokens are generated.
    每秒令牌数(TPS):这是令牌生成的速率。
  • Time Per Output Token (TPOT): The time it takes to generate each output token.
    输出令牌时间(TPOT):指生成每个输出令牌所需的时间。这可以理解为,处理每个输出单位所需的时间消耗。
  • Total Latency: The total time required to complete a response.
    总延迟:完成响应所需的总时间。

You’ll also want to track costs. Cost-related metrics are the number of queries and the volume of input and output tokens. If you use an API with rate limits, tracking the number of requests per second is important to ensure you stay within your allocated limits and avoid potential service interruptions.
你同样需要监控成本。成本相关的指标包括查询次数,以及输入和输出令牌的总量。如果你使用的是有速率限制的 API,追踪每秒的请求次数就显得尤为重要,这能确保你不会超出分配的使用限制,避免可能的服务中断。

When calculating metrics, you can choose between spot checks and exhaustive checks. Spot checks involve sampling a subset of data to quickly identify issues, while exhaustive checks evaluate every request for a comprehensive performance view. The choice depends on your system’s requirements and available resources, with a combination of both providing a balanced monitoring strategy.
在计算指标时,您可以选择抽样检查或全面检查。抽样检查通过选取数据子集,能快速识别问题;而全面检查则评估每个请求,提供全面的性能视图。选择哪种方式取决于您的系统需求和可用资源,两者结合使用可形成平衡的监控策略。

When computing metrics, ensure they can be broken down by relevant axes, such as users, releases, prompt/chain versions, prompt/chain types, and time. This granularity helps in understanding performance variations and identifying specific issues.
在计算指标时,确保能按相关维度细分,如用户、版本发布、提示/链的版本、提示/链的类型以及时间。这样的细分粒度有助于理解性能波动,并能定位到具体的问题。

Logs 日志记录

Since this blog post is getting long and I’ve written at length about logs in Designing Machine Learning Systems, I will be quick here. The philosophy for logging is simple: log everything. Log the system configurations. Log the query, the output, and the intermediate outputs. Log when a component starts, ends, when something crashes, etc. When recording a piece of log, make sure to give it tags and IDs that can help you know where in the system this log comes from.
由于这篇博客文章已经很长,而且我在《设计机器学习系统》中已经详细地写过日志,所以在这里我会简明扼要。日志的哲学很简单:记录一切。记录系统配置。记录查询、输出和中间输出。记录组件何时开始、结束,以及当某些东西崩溃等情况。在记录一条日志时,确保给它打上标签和 ID,以便你知道这条日志来自系统中的哪个位置。

Logging everything means that the amount of logs you have can grow very quickly. Many tools for automated log analysis and log anomaly detection are powered by AI.
记录所有内容意味着日志数量可能会迅速增加。许多用于自动化日志分析和检测日志异常的工具都是由人工智能驱动的。

While it’s impossible to manually process logs, it’s useful to manually inspect your production data daily to get a sense of how users are using your application. Shankar et al. (2024) found that the developers’ perceptions of what constitutes good and bad outputs change as they interact with more data, allowing them to both rewrite their prompts to increase the chance of good responses and update their evaluation pipeline to catch bad responses.
尽管手动处理日志是不可能的,但每天手动检查生产数据,以了解用户如何使用你的应用程序是有益的。Shankar 等人(2024)的研究表明,随着开发者与更多数据的互动,他们对好与坏的输出定义也在不断变化。这使他们能够重写提示,提高获得良好反馈的几率,并更新评估流程,以识别不良反馈。

Traces 追踪记录

Trace refers to the detailed recording of a request’s execution path through various system components and services. In an AI application, tracing reveals the entire process from when a user sends a query to when the final response is returned, including the actions the system takes, the documents retrieved, and the final prompt sent to the model. It should also show how much time each step takes and its associated cost, if measurable. As an example, this is a visualization of a Langsmith trace.
跟踪是指详细记录请求在各种系统组件和服务中的执行路径。在 AI 应用中,跟踪能揭示从用户发送查询到返回最终响应的整个流程,包括系统执行的操作、检索的文档以及发送给模型的最终提示。它还应显示每个步骤所需的时间,以及如果可量化的话,其相关成本。例如,这是 Langsmith 跟踪的可视化展示。

Overview of a genai platform


Ideally, you should be able to trace each query’s transformation through the system step-by-step. If a query fails, you should be able to pinpoint the exact step where it went wrong: whether it was incorrectly processed, the retrieved context was irrelevant, or the model generated a wrong response.
理想情况下,你应该能够追踪每个查询在系统中的逐步转换。如果查询失败,你应该能够准确地定位到出问题的确切步骤:是处理有误,检索到的上下文不相关,还是模型生成了错误的回应。

AI Pipeline Orchestration
人工智能管道编排

An AI application can get fairly complex, consisting of multiple models, retrieving data from many databases, and having access to a wide range of tools. An orchestrator helps you specify how these different components are combined (chained) together to create an end-to-end application flow.
人工智能应用可能相当复杂,涉及多个模型,从众多数据库中获取数据,并能访问各种工具。编排器能帮助你定义这些不同组件如何组合(串联)在一起,从而构建出端到端的应用流程。

At a high level, an orchestrator works in two steps: components definition and chaining (also known as pipelining).
从宏观层面来看,编排器的工作分为两个步骤:定义组件以及将它们串联起来(这一过程也常被称为管道化)。

  1. Components Definition 组件定义
    You need to tell the orchestrator what components your system uses, such as models (including models for generation, routing, and scoring), databases from which your system can retrieve data, and actions that your system can take. Direct integration with model gateways can help simplify model onboarding, and some orchestrator tools want to be gateways. Many orchestrators also support integration with tools for evaluation and monitoring.
    你需要告知编排器你的系统所使用的组件,比如各类模型(涵盖生成、路由和评分等功能)、可从中检索数据的数据库,以及系统可执行的操作。直接与模型网关集成有助于简化模型的导入流程,部分编排工具甚至希望充当网关角色。此外,许多编排器还支持与评估及监控工具的集成,以增强系统性能。

  2. Chaining (or pipelining)
    链式处理(或管道处理)

    You tell the orchestrator the sequence of steps your system takes from receiving the user query until completing the task. In short, chaining is just function composition. Here’s an example of what a pipeline looks like.
    你向编排器描述系统从接收用户查询到完成任务的整个流程,即一系列步骤的组合。简而言之,这就是函数的组合。下面是一个管道示例,你可以参考。

    1. Process the raw query.
      处理原始的查询内容。
    2. Retrieve the relevant data based on the processed query.
      根据处理后的查询,检索相关数据。
    3. The original query and the retrieved data are combined to create a prompt in the format expected by the model.
      将原始查询与检索到的数据结合,形成符合模型预期格式的提示信息。
    4. The model generates a response based on the prompt.
      该模型根据提示生成相应的回答。
    5. Evaluate the response. 评估回应的内容。
    6. If the response is considered good, return it to the user. If not, route the query to a human operator.
      若判定响应质量良好,应将其反馈给用户。反之,则需将问题转交至人工客服处理。

    The orchestrator is responsible for passing data between steps and can provide toolings that help ensure that the output from the current step is in the format expected by the next step.
    编排器担当在各步骤间传输数据的角色,并能提供工具确保当前步骤的输出,符合下一步骤所期望的格式,从而有效衔接各步骤。

When designing the pipeline for an application with strict latency requirements, try to do as much in parallel as possible. For example, if you have a routing component (deciding where to send a query to) and a PII removal component, they can do both at the same time.
在为具有严格延迟要求的应用程序设计管道时,应尽可能多地进行并行处理。例如,如果您有路由组件(决定将查询发送到何处)和 PII 删除组件,它们可以同时执行。

There are many AI orchestration tools, including LangChain, LlamaIndex, Flowise, Langflow, and Haystack. Each tool has its own APIs so I won’t show the actual code here.
有许多人工智能编排工具,其中包括 LangChain、LlamaIndex、Flowise、Langflow 和 Haystack。每个工具都有其自身的 API,因此我在这里不会展示实际的代码。

While it’s tempting to jump straight to an orchestration tool when starting a project, start building your application without one first. Any external tool brings added complexity. An orchestrator can abstract away critical details of how your system works, making it hard to understand and debug your system.
虽然在启动项目时直接使用编排工具很吸引人,但应先在没有编排工具的情况下开始构建应用程序。任何外部工具都会增加复杂性。编排工具可能会隐藏系统运作的关键细节,使得系统难以理解和调试。

As you advance to the later stages of your application development process, you might decide that an orchestrator can make your life easier. Here are three aspects to keep in mind when evaluating orchestrators.
当你进入应用程序开发的后期阶段,可能会发现使用编排器能让工作变得更加轻松。以下是评估编排器时需要考虑的三个方面。

  1. Integration and extensibility
    集成与扩展性

    Evaluate whether the orchestrator supports the components you’re already using or might adopt in the future. For example, if you want to use a Llama model, check if the orchestrator supports that. Given how many models, databases, and frameworks there are, it’s impossible for an orchestrator to support everything. Therefore, you’ll also need to consider an orchestrator’s extensibility. If it doesn’t support a specific component, how hard it is to change that?
    评估编排器是否支持您当前使用或未来可能采用的组件。例如,如果您打算使用羊驼模型,需确认编排器是否兼容。鉴于市面上众多的模型、数据库和框架,没有一款编排器能支持所有技术。因此,您还需考虑编排器的可扩展性。如果它不支持某个特定组件,实现支持的难度如何?
  2. Support for complex pipelines
    支持复杂的管道流程

    As your applications grow in complexity, you might need to manage intricate pipelines involving multiple steps and conditional logic. An orchestrator that supports advanced features like branching, parallel processing, and error handling will help you manage these complexities efficiently.
    当您的应用程序日益复杂,您可能需要处理包含多个步骤及条件逻辑的复杂流程。具备高级功能,如分支、并行处理和错误处理的调度器,将帮助您更高效地管理这些复杂性。
  3. Ease of use, performance, and scalability
    易于使用,性能卓越,可扩展性强

    Consider the user-friendliness of the orchestrator. Look for intuitive APIs, comprehensive documentation, and strong community support, as these can significantly reduce the learning curve for you and your team. Avoid orchestrators that initiate hidden API calls or introduce latency to your applications. Additionally, ensure that the orchestrator can scale effectively as the number of applications, developers, and traffic grows.
    考虑编排器的易用性。寻找直观的 API、全面的文档和强大的社区支持,这些都能显著降低你和团队的学习成本。避免使用会发起隐藏 API 调用或给应用带来延迟的编排器。此外,确保编排器能够随着应用数量、开发者数量和流量的增长而有效扩展。

Conclusion 结论

This post started with a basic architecture and then gradually added components to address the growing application complexities. Each addition brings its own set of benefits and challenges, requiring careful consideration and implementation.
本帖最初从一个基础架构出发,随后逐步添加组件以应对应用日益增长的复杂性。每次添加都伴随着各自的优势与挑战,需要审慎考量与实施。

While the separation of components is important to keep your system modular and maintainable, this separation is fluid. There are many overlaps between components. For example, a model gateway can share functionalities with guardrails. Cache can be implemented in different components, such as in vector search and inference services.
尽管组件的分离对于保持系统的模块化和可维护性至关重要,但这种分离并非绝对。组件之间存在许多重叠。例如,模型网关可以与防护措施共享功能。缓存可以在诸如向量搜索和推理服务等不同组件中实现。

This post is much longer than I intended it to be, and yet there are many details I haven’t been able to explore further, especially around observability, context construction, complex logic, cache, and guardrails. I’ll dive deeper into all these components in my upcoming book AI Engineering.
这篇文章比我预期的要长得多,还有很多细节我未能深入探讨,尤其是在可观察性、上下文构建、复杂逻辑、缓存和防护措施方面。我将在即将出版的《AI 工程学》一书中,更深入地探讨所有这些组件。

This post also didn’t discuss how to serve models, assuming that most people will be using models provided by third-party APIs. AI Engineering will also have a chapter dedicated to inference and model optimization.
本篇帖子并未讨论如何部署模型,其假设是大多数人会使用由第三方 API 提供的模型。AI 工程学也会有一章专门讲述推理及模型优化。

References and Acknowledgments
参考资料与致谢

Special thanks to Luke Metz, Alex Li, Chetan Tekur, Kittipat “Bot” Kampa, Hien Luu, and Denys Linkov for feedback on the early versions of this post. Their insights greatly improved the content. Any remaining errors are my own.
特别感谢 Luke Metz,Alex Li,Chetan Tekur,Kittipat“Bot”Kampa,Hien Luu 和 Denys Linkov 对本文早期版本的宝贵反馈。他们的真知灼见极大地提升了文章内容。任何残留的错误都由我本人承担。

I read many case studies shared by companies on how they adopted generative AI, and here are some of my favorites.
我读了很多公司分享的关于他们如何采用生成式 AI 的案例研究,以下是我最喜欢的一些案例。