H Company, a trailblazing French AI startup, has announced the release of Holo1.5, a groundbreaking suite of open foundation vision models meticulously crafted to empower computer-use (CU) agents. These agents interact with real user interfaces via screenshots and pointer/keyboard actions, making Holo1.5 a significant leap forward in AI’s ability to navigate and understand complex digital environments.
The Holo1.5 family comprises three models with varying parameters: 3B, 7B, and 72B. Each model boasts a documented 10% accuracy improvement over its predecessor, Holo1, across different sizes. Notably, the 7B model is licensed under Apache-2.0, making it freely available for production use. The 3B and 72B models, while currently research-only, offer valuable insights into the potential of larger models.
Holo1.5 is designed to excel in two core capabilities crucial for CU stacks: precise UI element localization and UI visual question answering (UI-VQA) for state understanding. UI element localization, or coordinate prediction, is the process by which an agent translates an intent into a pixel-level action. For instance, an agent might need to predict the clickable coordinates of the ‘Open Spotify’ control on the current screen. Failures in this process can lead to cascading errors, with a single off-by-one click derailing an entire workflow. To mitigate this, Holo1.5 is trained and evaluated on high-resolution screens (up to 3840×2160) across desktop (macOS, Ubuntu, Windows), web, and mobile interfaces. This ensures robustness on dense professional UIs where iconography and small targets can increase error rates.
Unlike general visual language models (VLMs) that optimize for broad grounding and captioning, Holo1.5 aligns its data and objectives with the specific requirements of CU agents. It undergoes large-scale supervised fine-tuning (SFT) on GUI tasks followed by reinforcement learning with human feedback (RLHF) to tighten coordinate accuracy and decision reliability. The models are delivered as perception components to be embedded in planners/executors, rather than as end-to-end agents.
Holo1.5’s performance on localization benchmarks is highly competitive. It reports state-of-the-art GUI grounding across ScreenSpot-v2, ScreenSpot-Pro, GroundUI-Web, Showdown, and WebClick. For instance, the 7B model achieves an average score of 77.32 on ScreenSpot-v2, compared to 60.73 for Qwen2.5-VL-7B. On the more challenging ScreenSpot-Pro, Holo1.5-7B scores 57.94 versus 29.00 for Qwen2.5-VL-7B, showing significantly better target selection under realistic conditions. The 3B and 72B checkpoints exhibit similar relative gains.
But Holo1.5’s prowess isn’t limited to localization. It also excels in UI understanding (UI-VQA), a crucial aspect of agent reliability. On benchmarks like VisualWebBench, WebSRC, and ScreenQA, Holo1.5 yields consistent accuracy improvements, with reported 7B averages of approximately 88.17 and 72B averages around 90.00. This enables agents to accurately answer queries like “Which tab is active?” or “Is the user signed in?”, reducing ambiguity and enabling verification between actions.
In comparison to specialized and closed systems, Holo1.5 outperforms open baselines and shows advantages versus competitive specialized systems and closed generalist models under the published evaluation setup. However, practitioners should replicate evaluations with their specific harness before drawing deployment-level conclusions, as protocols, prompts, and screen resolutions can influence outcomes.
The integration of Holo1.5 into CU agents brings several benefits. Firstly, it offers higher click reliability at native resolution, as evidenced by improved performance on ScreenSpot-Pro. This suggests reduced misclicks in complex applications like IDEs, design suites, and admin consoles. Secondly, stronger state tracking, facilitated by higher UI-VQA accuracy, improves detection of logged-in state, active tab, modal visibility, and success/failure cues. Lastly, the pragmatic licensing path allows for the use of the 7B model in production, with the 72B checkpoint available for internal experiments or to bound headroom.
In a modern CU stack, Holo1.5 serves as the screen perception layer. It takes full-resolution screenshots (optionally with UI metadata) as input and outputs target coordinates with confidence and short textual answers about screen state. Downstream, action policies convert these predictions into click/keyboard events, while monitoring verifies post-conditions and triggers retries or fallbacks.
Holo1.5 bridges a practical gap in CU systems by pairing strong coordinate grounding with concise interface understanding. For those seeking a commercially usable base today, Holo1.5-7B (Apache-2.0) is an excellent starting point. Benchmark it on your screens and instrument your planner/safety layers around it.
To get started, check out the models on Hugging Face and the technical details. The H Company GitHub page offers tutorials, codes, and notebooks, while their Twitter account and 100k+ ML SubReddit provide community support. Don’t forget to subscribe to their newsletter to stay updated on their latest developments.
In conclusion, H Company’s Holo1.5 is not just a release; it’s a significant step forward in AI’s ability to understand and interact with complex digital environments. By offering an open-source, high-performing model tailored to the needs of CU agents, H Company is democratizing access to advanced AI capabilities and paving the way for more intuitive, reliable, and efficient human-AI interactions.