xAI’s Grok 4.20 Sets Honesty Record but Trails in Intelligence

TL;DR

New Model: xAI launched Grok 4.20 in three API variants with pricing up to 60% cheaper than Grok 3.
Honesty Record: Grok 4.20 achieved a 78% non-hallucination rate on the Artificial Analysis Omniscience test, the highest of any model tested.
Intelligence Gap: The model ranks 8th on the Intelligence Index with a score of 48, trailing leaders Gemini 3.1 Pro and GPT-5.4 at 57.
Enterprise Focus: All variants support multi-agent orchestration, a 2-million-token context window, and provisioned throughput in US and EU regions.

Elon Musk’s xAI launched Grok 4.20 for developers in three API variants, pricing the new model up to 60% cheaper than its predecessor while setting a record for the lowest hallucination rate among tested AI models. As detailed on March 24, xAI’s Grok 4.20 developer page shows the model ships in reasoning, non-reasoning, and multi-agent configurations, all sharing a 2-million-token context window and identical tool support.

Furthermore, Grok 4.20 set a record non-hallucination rate of 78% on the Artificial Analysis Omniscience test while ranking just 8th on the same organization’s Intelligence Index with a score of 48. According to Artificial Analysis, that gap signals xAI is optimizing for reliability over raw benchmark dominance.

Grok 4.20 Offers Three Variants at Lower Prices

All three Grok 4.20 variants share identical token pricing: $20 per million input tokens and $60 per million output tokens. Compared to Grok 3, which remains available at $30 and $150 respectively, that represents a 33% reduction on input and 60% reduction on output.

Beyond the standard tier, long-context requests above 200,000 tokens are priced at $40 per million input and $120 per million output. xAI also offers budget alternatives through grok-4-fast and grok-4-1-fast at $2 per million input tokens and $5 per million output, giving developers a 10x cheaper option for less demanding workloads.

Under the simple alias “grok-4.20,” the reasoning variant serves as the default model call. Its non-reasoning counterpart strips out chain-of-thought processing for faster responses, while a dedicated multi-agent variant supports orchestration of up to four parallel agents in its Heavy consumer mode.

However, all three share the same core capabilities: text and image input, function calling, structured outputs, and tool capabilities including web search at $2.50 per query, X Search, code execution, and collections search for retrieval-augmented generation.

All variants share rate limits of 607 requests per minute and 4 million tokens per minute. On the consumer side, four modes are available (Auto, Fast, Expert, and Heavy), letting users trade response speed for reasoning depth without switching models.

Trails in Intelligence but Leads in Honesty

According to Artificial Analysis, Grok 4.20 scores 48 on the Intelligence Index v4.0, placing it 8th overall. Gemini 3.1 Pro Preview and GPT-5.4 lead at 57 points each, with Claude Opus 4.6 at 53.

While the 9-point gap to the leaders is substantial, Grok 4.20 represents a 6-point improvement over Grok 4 launched with record benchmark scores at the top of the Intelligence Index before later model releases surpassed it.

Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt (Source: Artificial Analysis)

In specialized benchmarks, Grok 4.20 fares considerably better. It took first place on IFBench with 83% for instruction following and ranked second on τ²-Bench Telecom with 97% for agentic tool use, trailing only GLM-5 in that category.

On the Omniscience test, Grok 4.20 achieved its strongest result: a 78% non-hallucination rate that no other model has matched. For enterprise customers evaluating deployment risk, a model that prioritizes reliability over raw intelligence scores may prove more valuable than one that scores higher but hallucinates more frequently.

Prior Context and Outlook

Grok 4.20 had been in beta testing since February 17 before officially exiting beta with the March 24 documentation release. xAI deployed the model with provisioned throughput options in both US East and EU West regions, indicating enterprise-grade availability from launch day.

In a February 28 assessment, statistician Nate Silver described Grok as “not on the lead lap” and considered it too unreliable for classified military settings. Grok 4.20’s Intelligence Index ranking of 8th partly validates the raw-performance concern, but the record-low hallucination rate directly challenges Silver’s reliability critique.

Building on earlier reliability work, xAI had launched Grok 4.1 in November as an intermediate step targeting emotional intelligence and reliability improvements before this latest release. Grok 4.20’s combination of lower pricing, multi-agent capabilities, and a reliability-first benchmark profile positions it as a production-oriented alternative in a market where many competitors still race primarily on intelligence scores.

xAI’s Grok 4.20 Sets Honesty Record but Trails in Intelligence

Grok 4.20 Offers Three Variants at Lower Prices

Recent Articles

Simulate real-world places with Project Genie and Street View

China’s RTX 5090D V2 Ban Tightens Nvidia’s China Squeeze

Sony’s Days of Play 2026 sale may skip the PS5 at the worst possible time

Fedora Pulls the Plug on Deepin Over Security and Maintenance Failures

Ericsson and Telstra team up for Australian 6G development

Related Stories