⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
📜 How It Works
- Blind Test: Ask any question to two anonymous AI chatbots (ChatGPT, Gemini, Claude, Llama, and more).
- Vote for the Best: Choose the best response. You can keep chatting until you find a winner.
- Play Fair: If AI identity reveals, your vote won't count.
- NEW: Click the 🎨 Text-to-Image tab below to generate images with DALL-E 3, Flux, and more! Use 🐙 RepoChat tab to chat with Github repos.
🏆 Chatbot Arena LLM Leaderboard
- Backed by over 1,000,000+ community votes, our platform ranks the best LLM and AI chatbots. Explore the top AI models on our LLM leaderboard!
👇 Chat now!
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Github link/issue/PR | Type your query |
---|---|
https://github.com/meta-llama/llama-recipes | fun use cases in llama recipes |
https://github.com/scikit-learn/scikit-learn | show me how run k-mean clustering |
https://github.com/sindresorhus/awesome | Make me a project structure for my food blog React webapp based entirely on awesome libraries. |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
📜 How It Works
- Ask any question to two chosen models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
- You can chat for multiple turns until you identify a winner.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
🤖 Choose two models to compare
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
🏔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper.
Chatbot Arena thrives on community engagement — cast your vote to help improve AI evaluation!
Total #models: 184. Total #votes: 2,465,686. Last updated: 2024-12-22.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Overall Questions
#models: 184 (100%) #votes: 2,465,686 (100%)
Rank* (UB) | Rank (StyleCtrl) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|---|
100 | 115 | 1372 | +11/-11 | 117916 | Cognitive Computations | Falcon-180B TII License |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.