New benchmark shows web agents still fail at half of common work tasks

Pulse

Reviewed by Helen Jones23 JunLast review 23 Jun 2026

New benchmark shows web agents still fail at half of common work tasks

Halluminate released WebBench, an open dataset of 2454 tasks across 452 live commercial websites. The benchmark tests browser agents on real pages rather than simulated environments. Agents scored above 70 percent on read-only tasks but only 46.6 percent on tasks that require writing or submitting forms. The strongest fully automated agent completed 66 percent of all tasks. The dataset and scoring code are available on GitHub under an MIT license.

Before this benchmark, teams could dismiss agent failures as edge cases or poor prompting. The new numbers come from 452 actual high-traffic sites and expose a structural gap: reading scales, but any step that changes data on the page still collapses half the time. This changes the risk calculation. A manager who builds team processes around the hope that agents will soon handle approvals, updates, or submissions is betting on a capability the current generation does not reliably deliver.

Analysis

Treat full automation as an experiment, not a production system. Map every recurring workflow your team runs through a browser and mark the exact handoff points where a human must still review or execute the write step before any agent touches it.

Read full story on mediar.ai

Pulse published by Collab365 Spaces, reviewed by Helen Jones on 23 Jun 2026. Cite as "New benchmark shows web agents still fail at half of common work tasks", Collab365 Spaces. 2 sources referenced.

spaces.collab365.com/posts/new-benchmark-shows-web-agents-still-fail-at-half--gwVgvr