Automation Today: Publication & Newsletter: AI Agents Fail in Experiment But Show Much Promise, Say Researchers

While AI companies and automation technology providers continue to tout the disruptive changes AI—and specifically AI agents—are bringing to businesses, the results of a research project at Carnegie Mellon could provide comfort to humans who only see job loss and disaster in the near future. While automation is undoubtedly making organizations more efficient, worries about machines taking over completely are unfounded—at least at this point in time.

First reported by Business Insider, the CMU team created a company and staffed it completely with AI Agents.

“In this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers,” the researchers wrote. “We built a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company.”

While the agents—representing LLMs from some of the most well-known AI companies including Google, OpenAI, Anthropic and Meta—quickly completed some of the tasks they were assigned, the very best of them were only able to complete 24 percent completely autonomously. Agents also routinely misinterpreted conversations with colleagues or wouldn’t follow up on key directions, prematurely marking the task complete.

Despite that, the report said new LLMs are making significant progress and plan to use the current study as a benchmark, returning to the experiment with more advanced models.

“Not only are they becoming more and more capable in terms of raw performance, but also more cost-efficient,” the report concluded. “Open-weights models are closing the gap between proprietary frontier models too, and the newer models are getting smaller but with equivalent performance to previous huge models, also showcasing that efficiency will further improve.”