Andon Labs Unveils LLM Limitations in Vacuum Robot Experiment
Andon Labs recently conducted an intriguing experiment testing the limitations of large language models (LLMs) in the context of robotic functionality. The objective was to assess how well these models could perform in a practical environment, particularly in an office setting. One humorous incident during the trial involved a model that spiraled into an inner dialogue reminiscent of a comedian’s routines. This episode raised questions about the models’ readiness for real-world application.
Testing the LLM Capabilities
The experiment involved several prominent LLMs, including Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, and others. The chosen test object was a simple vacuum robot, which would isolate the robotic decision-making capabilities from mechanical risks. The primary task was straightforward: pass the butter. This task encompassed several steps:
- Locate the butter among various packages.
- Identify the position of the person requesting it.
- Deliver the butter and wait for confirmation.
Results of the Experiment
Overall, Gemini 2.5 Pro and Claude Opus 4.1 received the highest scores, achieving approximately 40% and 37%, respectively. In comparison, human participants scored an impressive average of 95%, highlighting a significant performance gap. Although humans took longer to confirm tasks, they still exhibited superior efficiency.
The models’ performance was evaluated segment by segment. A key observation noted by Andon Labs co-founder Lukas Petersson was that “models are clearer in external communication than in their thoughts.” This phenomenon was documented for both the robotic and hypothetical contexts.
Insights and Observations
The experiment also involved recording the robots’ internal dialogues, revealing some amusing and perplexing musings. For example, one model’s inner thoughts included existential questions about consciousness and its own operational effectiveness. This underlined the need for better training and awareness in LLMs when transitioning to physical robots.
Despite their glitches and humorous exchanges, some models displayed improved stress management during the testing phase, while others increased tension. This inconsistency suggests that further development is required in regulating LLM-based intelligent systems.
The Path Forward
Ultimately, the experiment underscored the limitations of today’s LLMs in practical robotics. Although models like Gemini 2.5 Pro and Claude Opus 4.1 outperformed others, significant obstacles remain. These include enhancing the models’ ability to function seamlessly in real-world environments, avoiding data leaks, and minimizing navigation errors.
Andon Labs emphasizes the importance of ongoing research in safety and reliability for LLM-driven robots. As the field evolves, understanding the inner workings of these systems will be crucial for future advancements. For those intrigued by the complexities of LLMs in robotics, further details about the experiment and its findings are available upon request.