More than a century after the first IQ test was given, researchers still can’t agree on the best way to measure human intelligence. Experts debate whether cognitive abilities might be too complex and multifaceted to be captured by a single metric. How to weigh logic against, say, creativity? Problem-solving against critical thinking?
Now, the tech industry is joining the struggle; artificial intelligence, too, is proving difficult to evaluate.
Although neural networks like OpenAI’s GPT models are profoundly different from our brains, they are, for better or worse, improving so rapidly—and being created in such large numbers—that even their developers are having a hard time measuring their performance. There’s no universal benchmark for large language models (LLMs); most technical tests are highly specific and don’t reflect overall, real-world user experience, which now includes anything from coding and content creation to companionship (hi, Black Mirror).
Enter Chatbot Arena, a crowdsourced platform that started as a student side project by two Berkeley roommates, Anastasios Angelopoulos, Ph.D. ’24, and Wei-Lin Chiang, Ph.D. ’24. As the name suggests, Chatbot Arena pits bots like ChatGPT, Claude, and Gemini against each other in a blind taste test. Users give a prompt to two randomly chosen, anonymous chatbots, and pick the better answer. Only after casting their vote do they learn the identities of the algorithmic opponents. Chatbot Arena then uses statistical model Bradley-Terry to create a scoreboard based on those duels.

Originally built at the Berkeley Sky Computing Lab to test Chiang and his collaborators’ own open-source LLM, called Vicuna, Chatbot Arena grew much bigger than the students ever expected. Since its launch in 2023, the website has amassed over 2.8 million votes and become “the AI industry’s obsession,” as a Wall Street Journal article by reporter Miles Kruppa declared last December. It was something of an instant phenomenon: Within a week after launch, the site had already received 4,700 votes. The result, as Kruppa put it, is that “tech executives and engineers follow Chatbot Arena the way Wall Street traders watch the markets.”
No wonder. As companies race (and spend billions) to build the smartest models, even a sliver of perceived superiority can mean a big competitive edge. Investors want to back winners. Companies want bragging rights.
Google, OpenAI, and xAI have all collaborated with Chatbot Arena to test models ahead of public release.
The platform’s reach, Angelopoulos said, was possible only because of Berkeley’s Sky Lab—a “mecca” for building large-scale systems, where they were supported by such luminaries as their advisor, Berkeley Professor Ion Stoica.
While anyone can use the website—and the founders encourage as much—not all AIs make the cut. “People come here to use the best models, and so this isn’t really a place to test models that are not very good,” Angelopoulos said.
So far, their system seems to get it right. In January, when Chinese AI model DeepSeek-R1 stunned the industry by rising to the top of Apple’s App Store and rivaling big tech LLMs on a fraction of their budget, Arena users were hardly surprised; they had been watching DeepSeek models climb the rankings for months.
Chiang and Angelopoulos are now Berkeley postdocs and, in April, they announced that their “scrappy academic project” had become a startup, LMArena. Within weeks, the company raised $100 million in seed funding from A-list investors.
AI evaluation remains a big challenge for the tech industry. “If you could easily quantify intelligence,” Angelopoulos noted, “it would be easy to make intelligence.” But for now, at the very least, we know which bots most people would rather talk to.