We Tested Two Deep Research Tools. Only One Delivered

A side-by-side evaluation of ChatGPT’s and Grok’s Deep Research feature - and why only one stood out.

May 14, 2025

I spend 20 hours a week researching AI productivity tools.

Partly because if I do it, my clients don’t have to. And partly because - let’s be honest - I’m obsessed. With LLMs. The tools. The wild pace of this space.

So when ChatGPT and Grok (from xAI) both rolled out new research capabilities, I decided to test it on a task that eats up way too much time:

Evaluating the landscape of AI tools built specifically for product managers

I pitted Grok against ChatGPT’s Deep Research mode — same prompt, same goal, same format request.

What I got was... enlightening.

Let’s dive in.

The Task

I asked both ChatGPT and Grok to act like a seasoned researcher evaluating the fast-evolving landscape of AI tools built for PMs.

The goal?

Research the growing field of synthetic users.

Here’s what I gave them:

Role:
You are the most experienced person when it comes to building great products. Act like an expert in market research. Your tone is warm, friendly, smart, and direct to the point.

Input:
Topic: synthetic users 

Instructions:
Please provide a detailed, well-organized research report that includes insights, analysis, and specific examples under the following table of contents:

Table of Contents:
Executive summary (150–300 words) 
Market Overview & Trends
Top 5 most popular AI tools 
For each tool:
    Key features and capabilities
    Strengths
    Weaknesses
Summarize key praise and pain points from real user reviews, community discussions, or case studies.
List all sources used

ChatGPT: The Strategic Co-Pilot

Overall, I liked ChatGPT’s response the best.

Like any LLM, it wasn’t perfect. But it gave me a quick head start — which, frankly, is the hardest part of research.

I picked a topic like synthetic users for a reason: to push the Deep Research agent to its limits. There’s not much information out there, and what does exist is scattered across blogs, academic papers, niche tools, and forums.

Even so, ChatGPT understood what I was asking for, surfaced new tools I wasn’t aware of (gold), and provided genuinely valuable information on the first try.

What impressed me:

✅ Impressed that it suggested that building your own synthetic user (I came to a similar conclusion a few months ago) is better than using pre-packaged personas in generic tools

✅ Asked clarifying questions before diving in, making sure it fully understood the scope

✅ Surfaced non-obvious tools I hadn’t even flagged (like Akspot and Personno.ai)

What could be better:

❌ Didn’t show a research plan upfront - I had to dig into the sources to understand what it was doing

❌ Similar number of citations as Grok, though higher in quality

❌ The formatting was hard to read. I got a dense wall of text. Sure, I could fix that with a prompt, but Grok delivers cleaner formatting out of the box

You can read the full research output here.

Grok: Great Format, Weak Substance

Grok’s research was, frankly, underwhelming.

It surfaced familiar tools like Synthetic Users, ChatGPT, and Delve.ai — but unlike ChatGPT, it didn’t help me uncover anything new. Even the analysis of strengths and weaknesses felt shallow.

Overall, the research lacked depth. That might be due to the sources it pulled from, or simply the agent’s limited ability to think critically about the prompt.

Frankly, the best part about Grok’s Deep Research? The formatting. And when the best thing about your research is the formatting... you know you’re in trouble.

What Grok nailed:

✅ The experience felt polished. Once in research mode, it stayed there and created a more seamless interaction.

✅ Grok shined in formatting, it used bullets, tables, and clear headings that made the output easy to scan and digest.

✅ It cited about the same number of sources as ChatGPT, though they were different — which likely contributed to the variation in depth and quality.

Where it struggled:

❌ Insight felt generic, it mentioned obvious players like Synthetic Users, ChatGPT etc., but did not surface any new ones

❌ Provided surface-level summaries with minimal differentiation (e.g., “Tool use AI for personas” Okay... and?)

❌ It echoed what I could find in 10 minutes on Google. No true value-add.

You can read the full research output here.

The Verdict

The win goes to ChatGPT.

Grok’s lower-quality research made the output hard to use. And while both tools cited a similar number of sources, ChatGPT’s were more relevant and higher quality. Which is probably why the final report was much stronger.

What really set ChatGPT apart was its human-like approach to research. It weaved fluidly between sources, stitched ideas together, and showed a deeper understanding of the prompt.

Unless Grok’s core reasoning improves, I don’t see an instance in which you would choose Grok’s Deep Research feature over ChatGPT’s.

How to Get the Most Out of Deep Research

No matter how powerful these LLMs get. Good well structured prompts are the best way to get the most out of these black boxes.

You can't give simple questions, and expect a PHD level research report. These tools are smart, but they’re not mind readers.

Here are my tips for getting max leverage from the Deep Research:

Be ruthless about your prompt: Use a structured approach (I rely on my “5 Keys” framework) to make your ask crystal clear. Better questions equals higher quality output.
You still need to validate: Let AI do the heavy lifting, but you still need to review, cross-check, and spot the misses. Don’t take it all on face value
Save what works: Start building a prompt library now. Your future self will thank you.

Research is no longer a competitive advantage. AI can do in minutes what used to take me days. The edge now? Knowing which tool to use - and exactly how to use it.

The AI Empowered Product Manager

Discussion about this post