AI Research Assistants Test: Elicit vs Consensus vs Semantic Scholar vs Perplexity

If you’re a researcher, student, or anyone who needs to find and synthesize academic papers, you’ve probably noticed the explosion of AI tools promising to revolutionize literature reviews. But which ones actually work?

We tested the major players—Elicit, Consensus, Semantic Scholar, Perplexity Pro, and scite—on real research tasks. The results were mixed: some tools genuinely save hours of work, while others confidently cite papers that don’t exist.

The Tools We Tested

Elicit — AI workflow tool for systematic literature reviews. Searches papers, extracts data into tables, identifies themes. Claims 80% time savings for systematic reviews.

Consensus — AI search engine for peer-reviewed literature. Features a “Consensus Meter” showing scientific agreement levels. Searches 200M+ papers.

Semantic Scholar — Free AI-powered search from Allen Institute. Offers TLDR summaries, citation analysis, and research feeds. Indexes 200M+ papers.

Perplexity Pro — General AI search with academic mode. Can focus on scholarly sources, provides inline citations. $20/month.

scite — Smart citation analysis showing whether citations support or contradict claims. Indexes 1.6B+ citations across 280M+ sources.

How We Tested

We ran each tool through three tasks:

Discovery: Find recent papers on a specific topic
Synthesis: Summarize what the literature says about a claim
Citation verification: Check if the sources actually exist

The test topic: “What does the research say about the effectiveness of retrieval-augmented generation (RAG) in reducing LLM hallucinations?”

Results by Tool

Elicit: Best for Systematic Reviews

Elicit excels at structured extraction. You get tables with study characteristics, methodologies, sample sizes, and key findings pulled automatically. For researchers doing formal literature reviews, this is genuinely useful.

Strengths:

Finds up to 1,000 relevant papers per search
Extracts data into structured tables
Every claim links to specific sentences in source papers
Supports systematic review workflows with screening and data extraction

Weaknesses:

Free tier is limited (basic features only)
Pro pricing jumps to $49/month for full features
Interface can feel overwhelming for simple queries
Better for comprehensive reviews than quick lookups

Citation accuracy: High. Every extracted claim links to specific passages.

Verdict: The best choice for formal academic work where you need structured data extraction. Overkill for casual research.

Consensus: Best for Quick Evidence Checks

Ask Consensus “Does RAG reduce hallucinations?” and you get a direct answer with a visual meter showing scientific agreement, plus supporting citations from peer-reviewed papers.

Strengths:

The Consensus Meter is genuinely useful for contested claims
Filters results to peer-reviewed sources only
Clean interface focused on answering research questions
Can chat with individual papers for deeper analysis

Weaknesses:

Works best for medical and social policy questions
Coverage varies across disciplines
Less useful for very recent research (database lag)
Some questions don’t work well with the “consensus” framing

Citation accuracy: High. Sources are verifiable peer-reviewed papers.

Verdict: Excellent for getting quick, evidence-based answers to specific research questions. Not a replacement for full literature reviews.

Semantic Scholar: Best Free Option

Completely free, and it’s the backbone that powers several other tools (including Consensus). The TLDR summaries—20 words instead of 200—make skimming through results much faster.

Strengths:

Entirely free with no limits
TLDR summaries for ~60M papers in CS, bio, and medicine
Semantic Reader with AI-highlighted sections
Citation analysis with “highly influential” markers
Research feeds learn your interests

Weaknesses:

TLDRs only available for some fields
No synthesis or analysis—just search and discovery
Interface less polished than paid competitors
Requires more manual work to extract insights

Citation accuracy: Very high. It’s a database, not a generative tool, so papers are real.

Verdict: Should be in every researcher’s toolkit. Use it alongside a synthesis tool for best results.

Perplexity Pro: Best for Quick Exploration

Perplexity’s academic mode targets scholarly sources and provides inline citations. Good for rapid exploration, but the hallucination risk is real.

Strengths:

Fast, conversational interface
Can specify citation formats (APA, MLA, etc.)
Deep Research mode for multi-step investigations
Integrates with general web search when needed

Weaknesses:

Tested at 37-45% hallucination rate on citation accuracy
Sometimes misinterprets sources or blends findings from different studies
General-purpose tool, not built specifically for research
$20/month for Pro features

Citation accuracy: Mixed. The majority of studies retrieved are real and peer-reviewed, but hallucinations happen. Always verify.

Verdict: Useful for quick exploration and hypothesis generation. Don’t cite papers without verifying they exist.

scite: Best for Citation Analysis

Scite does something unique: it tells you whether citations support, contradict, or merely mention the work they cite. This is invaluable for understanding the actual scientific conversation around a claim.

Strengths:

1.6B+ citations analyzed across 280M+ sources
Shows supporting vs. contrasting citations
Browser extension integrates with research platforms
Useful for checking if a paper’s findings have been replicated

Weaknesses:

Citation classification isn’t always accurate
Interface focused on citation analysis, not discovery
Requires knowing what papers to analyze
Paid tiers for full features

Citation accuracy: Very high for identifying real papers. Classification accuracy is good but imperfect.

Verdict: Essential for understanding whether research findings have been supported or challenged. Best used alongside discovery tools.

The Hallucination Problem

Here’s the uncomfortable truth: LLMs still hallucinate citations. A Duke University study found 94% of students believe AI accuracy varies significantly across subjects, and they’re right to be skeptical.

A 2024 University of Mississippi study found 47% of AI-provided citations had incorrect titles, dates, authors, or combinations of all three. These weren’t obviously wrong—they were plausible-sounding papers that simply didn’t exist.

The tools that avoid this problem are the ones that search databases rather than generate text: Semantic Scholar, Consensus (which uses Semantic Scholar’s database), and scite. They can only return papers that actually exist in their indexes.

Elicit takes a hybrid approach: AI-assisted extraction from real papers, with every claim linked to specific source passages. This keeps hallucinations in check.

Perplexity, as a general-purpose LLM with search, has the highest hallucination risk. It retrieves real papers most of the time, but verification is essential.

Our Recommendations

For formal literature reviews: Start with Elicit. The structured extraction and systematic review support justify the price if you’re doing serious academic work.

For quick evidence checks: Consensus answers research questions directly and shows scientific agreement levels. Perfect for “what does the research say about X?”

For daily research: Semantic Scholar should be your default. It’s free, comprehensive, and the TLDR summaries save significant time.

For citation analysis: scite tells you whether a paper’s claims have been supported or challenged. Essential for understanding controversial findings.

For rapid exploration: Perplexity Pro is fast and conversational, but verify every citation before using it.

What We Actually Use

Most researchers end up combining 2-3 tools:

Discovery: Semantic Scholar or PapersFlow for finding papers
Analysis: Elicit or Consensus for synthesis
Citations: Zotero or Paperpile for management
Verification: scite for checking citation context

No single tool does everything well. The AI research assistant ecosystem is still maturing, and the best approach remains a combination of specialized tools plus human verification.

The Bottom Line

AI research assistants genuinely save time—the question is which ones you can trust. Database-backed tools like Semantic Scholar and Consensus have minimal hallucination risk because they search real papers rather than generating text. Synthesis tools like Elicit and Perplexity are more powerful but require verification.

Whatever you use, the rule remains: if you can’t verify a citation exists, don’t use it. The tools that make verification easy are the ones worth paying for.