AndroidLab

less than 1 minute read

AndroidLab: First ever systematic benchmark for Android mobile agents shows that small, fine-tuned open models can power a JARVIS system on your smartphone πŸ“±πŸ”₯

A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.

They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.

The team built:

πŸ“ŠΒ A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically

πŸ“πŸ“±Β A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces

βœ…Β An instruction dataset of 10.5k operation traces for training mobile agents

Key insights:

  • πŸ“ˆ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although it’s much smaller
  • βš™οΈ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.

capture