LLM Blackjack Benchmark

How well do language models play basic strategy? Results from 1,000 hands.

# Provider Model Accuracy Balance Mistakes
1
Google Google
Gemini 3 Pro 99.3% +20.5 10
2
Google Google
Gemini 3 Flash 98.2% +19.5 25
3
Z.ai Z.ai
GLM 4.7 96.8% +11.5 46
4
xAI xAI
Grok 4.1 Fast 96.1% +5.0 57
5
Anthropic Anthropic
Claude Opus 4.5 89.2% +12.5 163
6
OpenAI OpenAI
GPT-5.2 89.2% +25.0 152
7
Anthropic Anthropic
Claude Sonnet 4.5 83.2% -13.5 242
8
OpenAI OpenAI
GPT-4o Mini 74.0% -63.0 354