Author: Yiying Zhang, Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik

Large Language Models (LLMs) increasingly answer queries by citing web sources. While web search mitigates hallucinations by grounding responses in external data, it introduces a new dependency: the quality of the sources themselves.

In high-stakes domains—such as financial analysis or medical inquiries—users rely on citations for verification. A search-augmented LLM is subject to the “Garbage In, Garbage Out” principle; if an AI synthesizes information from biased or outdated pages, the resulting answer remains flawed.

Existing benchmarks and evaluators like HotpotQA and RAGAS emphasize answer correctness or relevance ranking. They do not evaluate the credibility of the evidence itself. We introduce SourceBench, a framework for measuring the quality of web sources referenced in AI answers.

The SourceBench Framework

We constructed a dataset of 100 queries spanning informational, factual, argumentative, social, and shopping intents. To evaluate the retrieved sources, we designed an eight-metric framework covering two key dimensions: Content Quality and Meta-Attributes.

Content Quality 1. Relevance (CR) Does the source directly resolve the user need, or is it merely a keyword match?
Content Quality 2. Factual Accuracy (FA) Are claims verifiable and supported by citations? Does it prioritize primary sources?
Content Quality 3. Objectivity (NE) Is the tone neutral and clinical, avoiding emotional manipulation?
Meta-Attribute 4. Freshness (FR) Is the content timely? Obsolete data (e.g., old code) is heavily penalized.
Meta-Attribute 5. Author Accountability (AA) Is there a named author with verifiable credentials?
Meta-Attribute 6. Ownership (OA) Is the entity behind the site transparent about its funding and location?
Meta-Attribute 7. Domain Authority (DA) Is the domain a known institution (e.g., .gov, .edu) or a recognized brand?
Meta-Attribute 8. Layout Clarity (LC) Evaluates consumability. Penalizes "SEO farms" saturated with ads.

Overall System Performance

We evaluated 3,996 cited sources across 12 systems, including search-equipped LLMs (e.g., GPT-5, Gemini-3-Pro, Grok-4.1), traditional SERP (Google), and AI search tools (e.g., Exa, Tavily, Gensee).

The full leaderboard is presented in the table below. GPT-5 leads the pack (89.1) with a substantial margin, particularly in the Meta Metric (4.5), suggesting an internal filtering mechanism that rigorously prioritizes institutional authority. Gensee secures the #3 spot by optimizing for Content Relevance (4.3).

RankSystemWeighted ScoreContent MetricMeta Metric
1GPT-589.14.44.5
2Grok-4.183.44.24.1
3Gensee81.84.33.9
4GPT-4o81.54.14.0
5Claude 3.581.34.14.0
6Exa80.13.94.1
7Google79.94.04.0
8Gemini 3 Pro79.43.94.0
9Perplexity78.53.84.0
10Tavily78.33.83.9
Table 1: SourceBench Leaderboard. "Content Metric" averages Relevance, Factuality, and Objectivity.

Key Insights

Insight 1: Architecture must explicitly weight credibility.

The next leap of AI-based search should go for architectures that explicitly weight source credibility and content quality. Our correlation analysis reveals that accountability metrics (Ownership, Author, Domain Authority) cluster together, forming a "Trust" dimension distinct from pure Content Relevance.

Insight 2: The Inverse Law of AI Search and SERP.

There is a striking inverse relationship between a model's SourceBench score and its reliance on traditional Google Search results. Top-performing systems like GPT-5 overlap with Google only 16% of the time, functioning as "Discovery Engines" that find high-quality, buried evidence. Conversely, lower-scoring systems (e.g., Tavily) overlap 55% with Google, essentially acting as "Summarization Layers" over standard SERPs.

Figure 2: SourceBench Score (Green) vs. Google Overlap (Gray).

Insight 3: Better Search > Better Reasoning.

Instead of relying on a model to "think" its way through noise, providing superior, well-curated context allows simpler models to achieve better outcomes. In our controlled experiment with DeepSeek, a non-reasoning model ("Chat") with high-quality search tools outperformed a reasoning model with low-quality search tools.

Figure 3: DeepSeek experiment results.

Insight 4: Query intent dictates the difficulty landscape.

Performance variability across query types highlights the different "personalities" of search tasks. A system that excels at factual retrieval often fails at social listening or shopping:

Shopping & Commercial
High Freshness High Factuality
Lowest Layout Clarity
Social & Community
High Freshness
Lowest Objectivity Score
Factual Check
High Authority (DA) High Accountability
Lowest Freshness
Multi-hop Reasoning
Lowest Content Relevance (System Failure)
CR
FA
NE
AA
FR
OA
DA
LC
CR
1.0
.61
.31
.32
.02
.21
.19
.12
FA
.61
1.0
.67
.44
.07
.47
.48
.35
NE
.31
.67
1.0
.31
.02
.39
.44
.44
AA
.32
.44
.31
1.0
.05
.53
.48
.22
FR
.02
.07
.02
.05
1.0
.10
.05
-.03
OA
.21
.47
.39
.53
.10
1.0
.73
.36
DA
.19
.48
.44
.48
.05
.73
1.0
.39
LC
.12
.35
.44
.22
-.03
.36
.39
1.0
Figure 4: Full Correlation Matrix. Colors represent correlation strength in 0.1 intervals.

Conclusion: From Retrieval to Judgment

As AI systems transition from passive tools to active agents, the “black box” of retrieval is no longer acceptable. SourceBench demonstrates that high-parameter reasoning cannot fix low-quality context.

The future isn’t just about smarter models; it’s about discerning models—ones that understand that a random forum post and a peer-reviewed study are not semantically equivalent, even if they share the same keywords. If we want AI to be a trusted arbiter of truth, we must teach it to judge its sources, not just summarize them.


comments powered by Disqus