by Stefano Benatti | Nov 5 2024
With the recent release of ChatGPT Search to all Plus users, one lingering question I had as a long term user of Perplexity is if I would be abandoning it in favor of ChatGPT. And so I went on to use both on a number of questions to compare their answers, but also where do each of them fetch data from and what other differences do they have?
Since Perplexity Pro allows usage of both GPT-4o and Claude 3.5 Sonnet, I opted for Claude in this comparison. I prefer both its style and answers over those of GPT-4o. Specifically, I'm not fond of GPT-4o's tendency to use lists and bullet points for everything. Instead, I favor Claude's approach of using explanations and titles.
Both systems share a common approach: they search the web for results related to the query, retrieve relevant information, and then use an LLM (Large Language Model) to summarize and create an answer incorporating these findings. In layman's terms, they summarize top web search results with AI. This process is commonly referred to as RAG (Retrieval Augmented Generation). There also appears to be a healthy dose of caching to speed up response times.
Another feature both systems share is their agentic planning behavior, similar to o1-preview (if you're familiar with it). They analyze gathered information to determine when to stop searching and sometimes apply additional steps. For example, they might use a code tool behind the scenes for mathematical calculations (which text models typically struggle with) or fetch extra images and information to quote. Both also may utilize "View Plugins" to display interactive components such as maps, weather details, and calendar events.
A common misconception is that LLMs perform poorly in search tasks due to their bias towards trained data. However, this isn't necessarily true for both ChatGPT and Perplexity. By using retrieval and agent workflows, these systems can incorporate additional data and behaviors on top of the base model. This approach helps reduce—though not entirely eliminate—the bias from their trained data.
Before going into specific results, it's very important to understand that the result the LLM comes up with can only be as good as the sources that they fetch in order to compose the answer. Generally, whichever does a better searching generates a better output all other things being equal.
After some research on how many and what ranking the sources used for each compare, these are my initial findings:
Keep in mind I only did a full analysis of the above for around 14 questions which provides observations not correlations. I intend to create an automated script to allow a higher number of source result analysis and I will publish a more in depth article to confirm hypothesis and update this with its reference later.