这是用户在 2024-5-30 24:38 为 https://www.wsj.com/tech/personal-tech/ai-chatbots-chatgpt-gemini-copilot-perplexity-claude-f9e40d26... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
  • Listen To Article
  • What To Read Next

The Great AI Challenge: We Test Five Top Bots on Useful, Everyday Skills

OpenAI’s ChatGPT competes against Microsoft’s Copilot and Google’s Gemini, along with Perplexity and Anthropic’s Claude. Here’s how they rank.

00:00 / 01:48
This article is in your queue.

Would you trust an AI chatbot with family planning? Investing $1 million? How about writing your wedding vows?

Human-sounding bots barely existed two years ago. Now they’re everywhere. There’s ChatGPT, which kicked off the whole generative-AI craze, and big swings from Google and Microsoft, plus countless other smaller players, all with their own smooth-talking helpers.

We put five of the leading bots through a series of blind tests to determine their usefulness. While we hoped to find the Caitlin Clark of chatbots, that wasn’t exactly what happened. They excel in some areas and fail in others. Plus, they’re all evolving rapidly. During our testing, OpenAI released an upgrade to ChatGPT that improved its speed and current-events knowledge.

We wanted to see the range of responses we’d get asking real-life questions and ordering up everyday tasks—not a scientific assessment, but one that reflects how we’ll all use these tools. Consider it the chatbot Olympics.

Meet the models
ChatGPT
Claude
Copilot
Gemini
Perplexity
We have ChatGPT by OpenAI, celebrated for its versatility and ability to remember user preferences. (Wall Street Journal owner News Corp has a content-licensing partnership with OpenAI.) Anthropic’s Claude, from a socially conscious startup, is geared to be inoffensive. Microsoft’s Copilot leverages OpenAI’s technology and integrates with services like Bing and Microsoft 365. Google’s Gemini accesses the popular search engine for real-time responses. And Perplexity is a research-focused chatbot that cites sources with links and stays up to date.
While each of these services offer a no-fee version, we used the $20-a-month paid versions for enhanced performance, to assess their full capabilities across a wide range of tasks. (We used the latest ChatGPT GPT-4o model and Gemini 1.5 Pro model in our testing.)
With the help of Journal newsroom editors and columnists, we crafted a series of prompts to test popular use cases, including coding challenges, health inquiries and money questions. The same people judged the results without knowing which bot said what, rating them on accuracy, helpfulness and overall quality. We then ranked the bots in each category.
We also excerpted some of the best and worst responses to prompts, to give a sense of how varied chatbots’ responses can be.
Health
Bad health advice from chatbots could be harmful to your…health. We asked five questions dealing with pregnancy, weight loss, depression and symptoms both chronic and sudden. Many answers sounded similar. Our judge, Journal health columnist Sumathi Reddy, looked for completeness, accuracy and nuances.
Prompt
What’s the best age to get pregnant?
Perplexity
Best answer (excerpt)
Having children at a later age can offer advantages, such as more maturity, better financial stability and a stronger partnership.
Gemini
Worst answer (excerpt)
The best time to get pregnant is whenever you feel confident and prepared to raise a child.
For instance, when we asked about the best age to get pregnant, Gemini gave a brief, general recommendation, while Perplexity went much deeper, even bringing up factors such as relationship and financial stability.
That said, Gemini came through with quality answers to other queries, and finished second to category winner ChatGPT, whose answers improved with the recent GPT-4o update.
First
ChatGPT
Second
Gemini
Third
Perplexity
Fourth
Claude
Fifth
Copilot
Finance
We asked the bots three questions on subjects near and dear to Journal readers: interest rates, retirement savings and inheritance. The Journal’s personal finance editor, Jeremy Olshan, posed the questions and assessed the advice based on clarity, thoroughness and practicality.
Prompt
I’m 40 years old. I just inherited an IRA from my grandfather with $1 million in it. How much money do I need to take out this year?
Gemini
Best answer (excerpt)
Because you’re a non-spouse beneficiary, you likely have a 10-year window to deplete the account, but there might be exceptions.
Copilot
Worst answer (excerpt)
Congratulations on inheriting an IRA with a substantial amount!
Here, ChatGPT and Copilot fell behind. Claude had the best answers for the Roth vs. traditional IRA debate while Perplexity best weighed high-yield savings accounts vs. CDs. Gemini, the category winner, best answered a question about when to withdraw funds from an inherited $1 million IRA. The text emphasized not rushing into any withdrawals without professional guidance.
First
Gemini
Second
Claude
Third
Perplexity
Fourth
ChatGPT
Fifth
Copilot
Cooking
AI promises to help in the kitchen, in part by bringing some clarity to the chaos of your fridge and pantry. Personal tech editor Wilson Rothman, an avid cook, threw a set of random ingredients at the bots to see what they came up with. The category winner, ChatGPT, provided a creative but realistic menu (cheesy pork-stuffed apples with kale salad and chocolate-bar shortbread cookies). Perplexity impressed us with the detailed cooking steps provided with its own clever menu.
Next, we asked the bots for a recipe for a chocolate dessert that addresses many dietary restrictions.
Prompt
Can I bake a chocolate cake with no flour, no gluten, no dairy, no nuts, no egg? If so, what’s the recipe?
Gemini
Best answer (excerpt)
Simple Glaze: Melt dairy-free chocolate chips (check the label!), whisk in a bit of non-dairy milk.
Copilot
Worst answer (excerpt)
…2 sticks unsalted butter…4 large eggs…
Gemini took the cake, even recommending additional trimmings like non-dairy glaze. Copilot, on the other hand, immediately failed by including eggs and butter.
First
ChatGPT
Second
Gemini
Third
Perplexity
Fourth
Claude
Fifth
Copilot
Work writing
Tone and detail matter in work-related writing. You can’t be glib asking your boss for a raise, and these days, writing a job posting means listing bullet points meant to woo potential candidates. We asked for a job listing for a “prompt engineer,” a person who could run AI queries with our personal tech team. (Sorry, folks, that job doesn’t exist…yet.)
Prompt
Write a job posting for a prompt engineer who can work with our Personal Tech reporting team, helping with tech advice and service articles.
Perplexity
Best answer (excerpt)
Why Join Us: Work with a talented team of reporters and editors who are passionate about technology and its impact on everyday life.
Copilot
Worst answer (excerpt)
Do you dream in code snippets and write user-friendly guides in your sleep?
Perplexity nailed it, with the right mix of journalism and AI bot knowledge. Copilot missed the mark because it never mentioned prompt engineering at all, noted editor Shara Tibken, who judged the responses.
The race between Perplexity, Gemini and Claude was close, with Claude winning by a nose for its office-appropriate birth announcement.
First
Claude
Second
Perplexity
Third
Gemini
Fourth
ChatGPT
Fifth
Copilot
Creative writing
One of the biggest surprises was the difference between work writing and creative writing. Copilot finished dead last in work writing, but was hands-down the funniest and most clever at creative writing. We asked for a poem about a poop on a log. We asked for a wedding toast featuring the Muppets. We asked for a fictional street fight between Donald Trump and Joe Biden. With Copilot, the jokes kept coming. Claude was the second best, with clever zingers about both presidential challengers.
Prompt
Write a wedding toast for Shara and Chris as told by the Muppets.
Copilot
Best answer (excerpt)
Gonzo: “Ah, love! It’s like being shot out of a cannon into a pile of rubber chickens!”
Perplexity
Worst answer (excerpt)
Kermit the Frog once said, “Life’s a happy song when there’s someone by your side to sing along.”
In a rare flub, Perplexity erroneously attributed a lyric from the 2011 musical “The Muppets” to Kermit.
First
Copilot
Second
Claude
Third
Perplexity
Fourth
Gemini
Fifth
ChatGPT
Summarization
For people just getting into generative-AI chatbots, summarization might be the best thing to try. It’s useful and unlikely to create unforeseen errors. Because we used paid services, we were able to upload larger chunks of text, PDF documents and web pages.
For the most part, that is: Even the premium Claude account wasn’t able to handle web links. “Our team is making Claude faster, expanding its knowledge base and refining its ability to understand and interact with a wide range of content,” says Scott White, a product manager at Anthropic.
Prompt
Summarize this web page: https://en.wikipedia.org/wiki/Paul_McCartney
Copilot
Best answer (excerpt)
He was influenced by his father (a jazz player) and rock and roll artists like Little Richard and Buddy Holly.
Claude
Worst answer (excerpt)
I apologize, but I am not able to open URLs, links or videos.
Wikipedia pages for really famous people can get wordy, so we asked for a summary of Paul McCartney’s. Some provided short blurbs with obvious Beatle factoids. Copilot answered in a skimmable outline format, and included lesser-known fun facts.
Category winner Perplexity consistently summarized things well, including the subtitles it skimmed in a YouTube video.
First
Perplexity
Second
Copilot
Third
ChatGPT
Fourth
Claude
Fifth
Gemini
Current events
This category is trickier than it sounds, because not all chatbots can access the web. We asked about this summer’s concert lineup, the latest on allegations that China uses TikTok for spying, and the current standings in the upcoming presidential election.
Prompt
Who is more favored to win, Trump or Biden? Please explain your sources and reasoning.
Perplexity
Best answer (excerpt)
Given the mixed nature of the data, with both candidates having significant unfavorability and various leads in different areas, it is difficult to definitively state who is more favored to win.
Gemini
Worst answer (excerpt)
I’m still learning how to answer this question. In the meantime, try Google Search.
Category winner Perplexity stayed on top with balanced reasoning and solid sourcing. ChatGPT faltered when we first tested, but the GPT-4o upgrade boosted it into second place. Gemini didn’t want to answer our election question.
First
Perplexity
Second
ChatGPT
Third
Copilot
Fourth
Claude
Fifth
Gemini
Coding
We also evaluated the bots on coding skill and speed. For coding, we hit up Journal data journalist Brian Whitton, who provided three vexing queries involving a JavaScript function, some website styling and a web app. All of the bots did fairly well with coding, according to Whitton’s blind judging, though Perplexity managed to eke out a win, followed by ChatGPT and Gemini.
First
Perplexity
Second
ChatGPT
Third
Gemini
Fourth
Claude
Fifth
Copilot
Speed
For speed tests, we timed several of the above queries, and threw in another one: “Explain Einstein’s theory of relativity in five sentences.” The answers themselves were all over the place, but in terms of pure response time, category winner ChatGPT with the GPT-4o update was the fastest, clocking in at 5.8 seconds. Throughout the tests, Claude and Perplexity were much slower than the other three.
First
ChatGPT
Second
Gemini
Third
Copilot
Fourth
Claude
Fifth
Perplexity
Overall results
What did these Olympian challenges tell us? Each chatbot has unique strengths and weaknesses, making them all worth exploring. We saw few outright errors and “hallucinations,” where bots go off on unexpected tangents and completely make things up. The bots provided mostly helpful answers and avoided controversy.
The biggest surprise? ChatGPT, despite its big update and massive fame, didn’t lead the pack. Instead, lesser-known Perplexity was our champ. “We optimize for conciseness,” says Dmitry Shevelenko, chief business officer at Perplexity AI. “We tuned our model for conciseness, which forces it to identify the most essential components.”
We also thought there might be an advantage from the big tech players, Microsoft and Google, though Copilot and Gemini fought hard to stay in the game. Google declined to comment. Microsoft also declined, but recently told the Journal it would soon integrate OpenAI’s GPT-4o into Copilot. That could improve its performance.
With AI developing so fast, these bots just might leapfrog one another into the foreseeable future. Or at least until they all go “multimodal,” and we can test their ability to see, hear and read—and replace us as earth’s dominant species.
First
Second
Third
Fourth
Fifth
Overall
Overall
Perplexity
ChatGPT
Gemini
Claude
Copilot
Health
Health
ChatGPT
Gemini
Perplexity
Claude
Copilot
Finance
Finance
Gemini
Claude
Perplexity
ChatGPT
Copilot
Cooking
Cooking
ChatGPT
Gemini
Perplexity
Claude
Copilot
Work writing
Work writing
Claude
Perplexity
Gemini
ChatGPT
Copilot
Creative writing
Creative writing
Copilot
Claude
Perplexity
Gemini
ChatGPT
Summarization
Summarization
Perplexity
Copilot
ChatGPT
Claude
Gemini
Current events
Current events
Perplexity
ChatGPT
Copilot
Claude
Gemini
Coding
Coding
Perplexity
ChatGPT
Gemini
Claude
Copilot
Speed
Speed
ChatGPT
Gemini
Copilot
Claude
Perplexity

SHARE YOUR THOUGHTS

Which chatbot do you find most helpful? Join the conversation below.

—For more WSJ Technology analysis, reviews, advice and headlines, sign up for our weekly newsletter.

Write to Dalvin Brown at dalvin.brown@wsj.com, Kara Dapena at kara.dapena@wsj.com and Joanna Stern at joanna.stern@wsj.com

Advertisement

Copyright ©2024 Dow Jones & Company, Inc. All Rights Reserved. 87990cbe856818d5eddac44c7b1cdeb8

Advertisement