ChatGPT Search VS Perplexity Pro - who wins

by Stefano Benatti | Nov 5 2024

perplexity vs openai.webp

With the recent release of ChatGPT Search to all Plus users, one lingering question I had as a long term user of Perplexity is if I would be abandoning it in favor of ChatGPT. And so I went on to use both on a number of questions to compare their answers, but also where do each of them fetch data from and what other differences do they have?

Since Perplexity Pro allows usage of both GPT-4o and Claude 3.5 Sonnet, I opted for Claude in this comparison. I prefer both its style and answers over those of GPT-4o. Specifically, I'm not fond of GPT-4o's tendency to use lists and bullet points for everything. Instead, I favor Claude's approach of using explanations and titles.

First, how do they work?

Both systems share a common approach: they search the web for results related to the query, retrieve relevant information, and then use an LLM (Large Language Model) to summarize and create an answer incorporating these findings. In layman's terms, they summarize top web search results with AI. This process is commonly referred to as RAG (Retrieval Augmented Generation). There also appears to be a healthy dose of caching to speed up response times.

Another feature both systems share is their agentic planning behavior, similar to o1-preview (if you're familiar with it). They analyze gathered information to determine when to stop searching and sometimes apply additional steps. For example, they might use a code tool behind the scenes for mathematical calculations (which text models typically struggle with) or fetch extra images and information to quote. Both also may utilize "View Plugins" to display interactive components such as maps, weather details, and calendar events.

A common misconception is that LLMs perform poorly in search tasks due to their bias towards trained data. However, this isn't necessarily true for both ChatGPT and Perplexity. By using retrieval and agent workflows, these systems can incorporate additional data and behaviors on top of the base model. This approach helps reduce—though not entirely eliminate—the bias from their trained data.

So who wins?

Before going into specific results, it's very important to understand that the result the LLM comes up with can only be as good as the sources that they fetch in order to compose the answer. Generally, whichever does a better searching generates a better output all other things being equal.

Sources comparison

After some research on how many and what ranking the sources used for each compare, these are my initial findings:

Number of Sources Consulted: ChatGTP usually fetches a larger number of results in comparison to Perplexity, with a median of 12 sources. Meanwhile Perplexity Pro has a lower number, with a median of 6.
Search Results Similarity: ChatGPT sources are much closer to DuckDuckGo (and Bing second) first 20 results for the same search, while Perplexity Pro sources are much closer to Google first 20 results.
Recency: (TIL this word exists) ChatGPT seems to put higher weight on recency compared to Perplexity, and a lot of the sources have more recent dates. This is in large due to the high number of partnerships OpenAI has been doing with news outlets and sources of real time information.
Regionality: due to its similarity to search results between Google and DuckDuckGo, Perplexity with it's Google alike source results are a lot more regional when compared to ChatGPT and it's DuckDuckGo similarity that is more global. When searching for restaurants for example, ChatGPT went to global websites like tripadvisor, while Perplexity used regional restaurant websites (in Brazil since this is where I live) often. The same applies to recipes and other questions.
Source formats: Both leveraged both text, image and video content in their results, with text being the majority of source results. Perplexity seemed to favor videos more often than ChatGPT.
Source matches between ChatGPT vs Perplexity: around ~40% of the sources matched, which also means ~60% were different

Keep in mind I only did a full analysis of the above for around 14 questions which provides observations not correlations. I intend to create an automated script to allow a higher number of source result analysis and I will publish a more in depth article to confirm hypothesis and update this with its reference later.