人工智慧爬蟲的崛起

5 min read

Dec 17, 2024

Real-world data from MERJ and Vercel shows distinct patterns from top AI crawlers.
MERJ 和 Vercel 的實際數據顯示與頂級 AI 爬蟲有明顯的不同特徵。

AI crawlers have become a significant presence on the web. OpenAI's GPTBot generated 569 million requests across Vercel's network in the past month, while Anthropic's Claude followed with 370 million. For perspective, this combined volume represents about 20% of Googlebot's 4.5 billion requests during the same period.
AI 爬蟲已成為網絡上的重要角色。OpenAI 的 GPTBot 在過去一個月內在 Vercel 網絡上產生了 5.69 億次請求，而 Anthropic 的 Claude 則緊隨其後，產生了 3.7 億次請求。與此同期，Google 機器人產生了 4.5 億次請求，可見這些 AI 爬蟲的請求量約佔其 20%。

After analyzing how Googlebot handles JavaScript rendering with MERJ， we turned our attention to these AI assistants. Our new data reveals how Open AI’s ChatGPT， Anthropic’s Claude， and other AI tools crawl and process web content.
在分析 Googlebot 如何使用 MERJ 處理 JavaScript 渲染後，我們將注意力轉移到這些 AI 助手。我們的新數據揭示了 OpenAI 的 ChatGPT、Anthropic 的 Claude 以及其他 AI 工具如何爬網和處理網頁內容。

We uncovered clear patterns in how these crawlers handle JavaScript, prioritize content types, and navigate the web, which directly impact how AI tools understand and interact with modern web applications.
我們發現這些網路爬蟲在處理 JavaScript、優先處理內容類型以及瀏覽網頁的方式上存在明確的模式，這直接影響了 AI 工具對現代網絡應用程式的理解和互動。

Data collection process 數據蒐集流程

Our primary data comes from monitoring nextjs.org and the Vercel network for the past few months. To validate our findings across different technology stacks, we also analyzed two job board websites: Resume Library, built with Next.js, and CV Library, which uses a custom monolithic framework. This diverse dataset helps ensure our observations about crawler behavior are consistent across different web architectures.
我們的主要數據來自於過去幾個月對 nextjs.org 和 Vercel 網絡的監控。為了驗證我們在不同技術堆棧中的發現，我們還分析了兩個求職網站:使用 Next.js 構建的 Resume Library 和使用自定義單體框架的 CV Library。這個多樣的數據集有助於確保我們對網頁爬蟲行為的觀察在不同的網絡架構中是一致的。

For more details on how we collected this data, see our first article.
如需瞭解我們如何收集這些數據的更多詳情，請參閱我們的首篇文章。

Note: Microsoft Copilot was excluded from this study as it lacks a unique user agent for tracking.
請注意,Microsoft Copilot 未包括在此研究中,因為它沒有獨特的使用者代理程式可供追蹤。

Scale and distribution 規模與分佈

The volume of AI crawler traffic across Vercel's network is substantial. In the past month:
透過 Vercel 網路的 AI 爬蟲流量非常龐大。在過去一個月內:

Googlebot: 4.5 billion fetches across Gemini and Search
Googlebot 在 Gemini 和搜尋引擎上進行了 45 億次擷取
GPTBot (ChatGPT): 569 million fetches
GPTBot (ChatGPT)：已進行 569 百萬次提取
Claude: 370 million fetches
Claude：3.7 億次取用
AppleBot: 314 million fetches
蘋果機器人：3.14 億次擷取
PerplexityBot: 24.4 million fetches
PerplexityBot：2，440 萬次取用

While AI crawlers haven't reached Googlebot's scale， they represent a significant portion of web crawler traffic. For context， GPTBot， Claude， AppleBot， and PerplexityBot combined account for nearly 1.3 billion fetches—a little over 28% of Googlebot's volume.
雖然 AI 爬蟲尚未達到 Googlebot 的規模，但它們已佔了網絡爬蟲流量的相當大比例。舉例來說，GPTBot、Claude、AppleBot 和 PerplexityBot 合計佔了近 13 億次抓取，約佔 Googlebot 流量的 28%。

Geographic distribution 地理分佈

All AI crawlers we measured operate from U.S. data centers:
我們所測試的所有 AI 爬蟲都是從位於美國的數據中心進行操作。

ChatGPT: Des Moines (Iowa), Phoenix (Arizona)
ChatGPT：得梅因(愛荷華州)、鳳凰城(亞利桑那州)
Claude: Columbus (Ohio) Claude:俄亥俄州哥倫布斯

In comparison, traditional search engines often distribute crawling across multiple regions. For example, Googlebot operates from seven different U.S. locations, including The Dalles (Oregon), Council Bluffs (Iowa), and Moncks Corner (South Carolina).
相比之下，傳統搜尋引擎通常會將網頁爬取工作分散到多個地區。例如，Googlebot 在美國七個不同的位置運行，包括俄勒岡州的 The Dalles、愛荷華州的 Council Bluffs 和南卡羅來納州的 Moncks Corner。

JavaScript rendering capabilities
JavaScript 的渲染能力

Our analysis shows a clear divide in JavaScript rendering capabilities among AI crawlers. To validate our findings, we analyzed both Next.js applications and traditional web applications using different tech stacks.
我們的分析顯示，AI 爬蟲在 JavaScript 渲染能力上存在明顯差異。為了驗證這一發現，我們分析了使用不同技術棧的 Next.js 應用程序和傳統 Web 應用程序。

The results consistently show that none of the major AI crawlers currently render JavaScript. This includes:
結果一致顯示，目前沒有主要的 AI 爬蟲能夠正確呈現 JavaScript。這包括:

OpenAI (OAI-SearchBot, ChatGPT-User, GPTBot)
OpenAI (OAI-搜尋機器人、ChatGPT-用戶、GPT 機器人)
Anthropic (ClaudeBot) 人工智慧公司 (ClaudeBot)
Meta (Meta-ExternalAgent)
元宇宙 (元宇宙-外部代理)
ByteDance (Bytespider) 字節跳動 (字節蜘蛛)
Perplexity (PerplexityBot)
困惑 (PerplexityBot)

The results also show: 結果亦顯示:

Google's Gemini leverages Googlebot's infrastructure, enabling full JavaScript rendering.
<code1001>的 Gemini 充分利用了 Googlebot 的基礎設施，實現了完整的 JavaScript 渲染功能。</code1001>
AppleBot renders JavaScript through a browser-based crawler, similar to Googlebot. It processes JavaScript, CSS, Ajax requests, and other resources needed for full-page rendering.
AppleBot 使用基於瀏覽器的爬蟲程式來渲染 JavaScript，類似於 Googlebot。它能夠處理 JavaScript、CSS、Ajax 請求以及其他完整頁面渲染所需的資源。
Common Crawl (CCBot), which is often used as a training dataset for Large Language Models (LLMs) does not render pages.
Common Crawl (CCBot)通常用作大型語言模型(LLMs)的訓練數據集，但它無法呈現網頁。

The data indicates that while ChatGPT and Claude crawlers do fetch JavaScript files (ChatGPT: 11.50%, Claude: 23.84% of requests), they don't execute them. They can't read client-side rendered content.
數據顯示，雖然 ChatGPT 和 Claude 爬蟲會抓取 JavaScript 檔案(ChatGPT: 11.50%，Claude: 23.84% 的請求)，但它們並不會執行這些檔案。它們無法讀取客戶端渲染的內容。

Note, however, that content included in the initial HTML response, like JSON data or delayed React Server Components, may still be indexed since AI models can interpret non-HTML content.
但請注意，包含在初始 HTML 響應中的內容，如 JSON 數據或延遲的 React Server Components，可能仍會被索引，因為 AI 模型能夠解讀非 HTML 內容。

In contrast, Gemini's use of Google's infrastructure gives it the same rendering capabilities we documented in our Googlebot analysis, allowing it to process modern web applications fully.
相反地，Gemini 利用Google的基礎設施，賦予它與我們在 Googlebot 分析中記錄的相同渲染能力，使其能夠完全處理現代網路應用程式。

Content type priorities 內容類型優先順序

AI crawlers show distinct preferences in the types of content they fetch on nextjs.org. The most notable patterns:
AI 爬蟲在獲取 nextjs.org 內容時，呈現出明確的偏好。最顯著的模式如下:

ChatGPT prioritizes HTML content (57.70% of fetches)
ChatGPT 優先處理 HTML 內容 (佔 57.70% 的獲取)
Claude focuses heavily on images (35.17% of total fetches)
Claude 非常著重於圖像處理，佔總取用量的 35.17%。
Both crawlers spend significant time on JavaScript files (ChatGPT: 11.50%, Claude: 23.84%) despite not executing them
兩個爬蟲程式都花了大量時間在 JavaScript 檔案上 (ChatGPT: 11.50%， Claude: 23.84%)，儘管它們並未執行這些檔案

For comparison, Googlebot's fetches (across Gemini and Search) are more evenly distributed:
相比之下，Googlebot 的提取(涵蓋 Gemini 和搜尋)分佈更加均勻:

31.00% HTML content
29.34% JSON data
20.77% plain text
15.25% JavaScript

These patterns suggest AI crawlers collect diverse content types—HTML， images， and even JavaScript files as text—likely to train their models on various forms of web content.
這些模式顯示 AI 爬蟲收集了各種內容類型 - HTML、圖像和甚至 JavaScript 檔案作為文字 - 很可能是為了在不同形式的網絡內容上訓練他們的模型。

While traditional search engines like Google have optimized their crawling patterns specifically for search indexing, newer AI companies may still be refining their content prioritization strategies.
雖然像Google這樣的傳統搜尋引擎已經針對搜尋索引優化了他們的爬行模式，但新興的人工智慧公司可能仍在完善他們的內容優先化策略。

Crawling (in)efficiency

Our data shows significant inefficiencies in AI crawler behavior:
我們的數據顯示人工智慧爬蟲行為存在嚴重低效問題:

ChatGPT spends 34.82% of its fetches on 404 pages
ChatGPT 有 34.82% 的請求是在 404 頁面上
Claude shows similar patterns with 34.16% of fetches hitting 404s
Claude 顯示類似的模式，其中 34.16% 的請求返回 404 錯誤
ChatGPT spends an additional 14.36% of fetches following redirects
ChatGPT 在跟隨重新導向時額外花費 14.36% 的提取次數

Analysis of 404 errors reveals that, excluding robots.txt, these crawlers frequently attempt to fetch outdated assets from the /static/ folder. This suggests a need for improved URL selection and handling strategies to avoid unnecessary requests.
分析 404 錯誤顯示，除了 robots.txt 外，這些爬蟲經常試圖從 /static/ 資料夾獲取過時的資產。這表明需要改善 URL 選擇和處理策略，以避免不必要的請求。

These high rates of 404s and redirects contrast sharply with Googlebot, which spends only 8.22% of fetches on 404s and 1.49% on redirects, suggesting Google has spent more time optimizing its crawler to target real resources.
這些高達 404 和重新導向的比率與 Googlebot 形成鮮明對比，Googlebot 只有 8.22%的提取是 404，1.49%是重新導向，這表示Google已花費更多時間優化其爬蟲以瞄準真實資源。

Traffic correlation analysis
流量關聯性分析

Our analysis of traffic patterns reveals interesting correlations between crawler behavior and site traffic. Based on data from nextjs.org:
我們對流量模式的分析發現，爬蟲行為與網站流量之間存在有趣的相關性。根據 nextjs.org 的數據顯示:

Pages with higher organic traffic receive more frequent crawler visits
擁有較高自然搜尋流量的網頁會受到更頻繁的爬蟲訪問
AI crawlers show less predictable patterns in their URL selection
AI 爬蟲在選擇網址時呈現出較不可預測的行為模式
High 404 rates suggest AI crawlers may need to improve their URL selection and validation processes, though the exact cause remains unclear
高達 404 的錯誤率顯示 AI 爬蟲可能需要改善其 URL 選擇和驗證流程，但具體原因仍不明確

While traditional search engines have developed sophisticated prioritization algorithms, AI crawlers are seemingly still evolving their approach to web content discovery.
儘管傳統搜尋引擎已開發出複雜的優先排序演算法，人工智慧爬蟲似乎仍在不斷完善其網絡內容發現的方法。

Our research with Vercel highlights that AI crawlers, while rapidly scaling, continue to face significant challenges in handling JavaScript and efficiently crawling content. As the adoption of AI-driven web experiences continues to gather pace, brands must ensure that critical information is server-side rendered and that their sites remain well-optimized to sustain visibility in an increasingly diverse search landscape.
我們與 Vercell 的研究發現，AI 爬蟲雖然擴展迅速，但在處理 JavaScript 和有效爬取內容方面仍面臨重大挑戰。隨著 AI 驅動的網絡體驗越來越普及，品牌必須確保關鍵資訊在服務器端呈現，並保持其網站的良好優化，以在日益多元化的搜尋環境中保持可見度。Our research with Vercel highlights that AI crawlers, while rapidly scaling, continue to face significant challenges in handling JavaScript and efficiently crawling content. As the adoption of AI-driven web experiences continues to gather pace, brands must ensure that critical information is server-side rendered and that their sites remain well-optimized to sustain visibility in an increasingly diverse search landscape.Our research with Vercel highlights that AI crawlers, while rapidly scaling, continue to face significant challenges in handling JavaScript and efficiently crawling content.
我們與 Vercel 的研究發現，儘管 AI 爬蟲正在快速擴展，但在處理 JavaScript 和有效爬取內容方面仍面臨著重大挑戰。
As the adoption of AI-driven web experiences continues to gather pace, brands must ensure that critical information is server-side rendered and that their sites remain well-optimized to sustain visibility in an increasingly diverse search landscape.
隨著 AI 驅動的網絡體驗日益普及，品牌必須確保關鍵資訊在伺服器端呈現，並保持網站的良好優化，以在日益多元化的搜尋環境中維持知名度。
Ryan Siddle, Managing Director of MERJ

Recommendations 建議

For site owners who want to be crawled
希望被網路爬蟲收錄的網站所有者

Prioritize server-side rendering for critical content. ChatGPT and Claude don't execute JavaScript, so any important content should be server-rendered. This includes main content (articles, product information, documentation), meta information (titles, descriptions, categories), and navigation structures. SSR, ISR, and SSG keep your content accessible to all crawlers.
優先考慮對關鍵內容進行伺服器端渲染。ChatGPT 和 Claude 無法執行 JavaScript，因此任何重要內容都應該在伺服器端渲染。這包括主要內容(文章、產品資訊、檔案)、元資訊(標題、描述、類別)和導航結構。使用 SSR、ISR 和 SSG 可以確保你的內容能被所有爬蟲程式訪問。
Client-side rendering still works for enhancement features. Feel free to use client-side rendering for non-essential dynamic elements like view counters, interactive UI enhancements, live chat widgets, and social media feeds.
客戶端渲染仍然適用於增強功能。你可以自由使用客戶端渲染來處理非必要的動態元素，例如查看計數器、互動式 UI 增強、即時聊天小部件和社交媒體源。
Efficient URL management matters more than ever. The high 404 rates from AI crawlers highlight the importance of maintaining proper redirects, keeping sitemaps up to date, and using consistent URL patterns across your site.
高效的網址管理比以往任何時候都更加重要。人工智慧爬蟲的高 404 率突出了維護適當重新導向、保持網站地圖最新以及在整個網站上使用一致的網址模式的重要性。

For site owners who don't want to be crawled
對於不想被網路爬蟲抓取的網站所有者

Use robots.txt to control crawler access. The robots.txt file is effective for all measured crawlers. Set specific rules for AI crawlers by specifying their user agent or product token to restrict access to sensitive or non-essential content.
使用 robots.txt 來控制爬蟲的訪問。 robots.txt 檔案對所有測量的爬蟲都有效。可以透過指定使用者代理或產品代號，為 AI 爬蟲設定特定規則，以限制對敏感或非必要內容的存取。
To find the user agents to disallow， you’ll need to look in each company’s own documentation (for example， Applebot and OpenAI’s crawlers).
要找到要禁止的使用者代理，你需要查看每家公司自己的檔案(例如，Applebot 和 OpenAI 的爬蟲程式)。
Block AI crawlers with Vercel's WAF. Our Block AI Bots Firewall Rule lets you block AI crawlers with one click. This rule automatically configures your firewall to deny their access.
使用 Vercel 的 WAF 阻擋 AI 爬蟲。我們的「阻擋 AI 機器人防火牆規則」讓你只需一鍵即可阻擋 AI 爬蟲。此規則會自動設定你的防火牆以拒絕他們的存取。

For AI users 適用於 AI 使用者

JavaScript-rendered content may be missing. Since ChatGPT and Claude don't execute JavaScript, their responses about dynamic web applications may be incomplete or outdated.
由於 ChatGPT 和 Claude 無法執行 JavaScript，他們對動態網頁應用程式的回應可能會不完整或過時。
Consider the source. High 404 rates (>34%) mean that when AI tools cite specific web pages, there's a significant chance those URLs are incorrect or inaccessible. For critical information, always verify sources directly rather than relying on AI-provided links.
請謹慎考慮資訊來源。高達 34%以上的 404 錯誤率意味著，當 AI 工具引用特定網頁時，這些 URL 很可能是不正確或無法訪問的。對於重要資訊，請直接驗證來源，不要單單依賴 AI 提供的鏈接。
Expect inconsistent freshness. While Gemini leverages Google's infrastructure for crawling, other AI assistants show less predictable patterns. Some may reference older cached data.
預期新鮮度會有所不同。雖然 Gemini 利用 Google 的基礎設施進行爬網，但其他 AI 助手的行為模式較難預測。有些可能參考較舊的快取資料。

Interestingly, even when asking Claude or ChatGPT for fresh Next.js docs data, we often don't see immediate fetches in our server logs for nextjs.org. This suggests that AI models may rely on cached data or training data, even when they claim to have fetched the latest information.
即使要求 Claude 或 ChatGPT 提供最新的 Next.js 文檔數據，我們通常也沒有在服務器日誌中看到即時的提取。這表明 AI 模型可能會依賴緩存數據或訓練數據，即使它們聲稱已經獲取了最新的資訊。

Final thoughts 最後的心得

Our analysis reveals that AI crawlers have quickly become a significant presence on the web, with nearly 1 billion monthly requests across Vercel's network.
我們的分析發現，AI 爬蟲已迅速成為網絡上的重要角色，每月在 Vercel 的網絡上有近 10 億次請求。

However， their behavior differs markedly from traditional search engines， when it comes to rendering capabilities， content priorities， and efficiency. Following established web development best practices—particularly around content accessibility—remains crucial.
然而，它們的行為與傳統搜尋引擎有著明顯的差異，在渲染能力、內容優先順序和效率方面。遵循既定的網頁開發最佳實踐，特別是在內容可訪問性方面，仍然至關重要。

DX Platform

Managed Infrastructure

Open Source

Use Cases

Users

Tools

Company

The rise of the AI crawler
人工智慧爬蟲的崛起