Samelogic Logo
ComparePricing

How Agents see the Web

AI agents fail on web pages because screenshots, DOM dumps, and selectors all break after a deploy. Here's what agents need to target elements reliably.

how agents see the web

I keep watching demos where an AI agent navigates a browser, fills out a form, clicks a button, and everyone in the room acts like the problem is solved. The agent found the element. It performed the action. Ship it.

Then I watch the same agent run the same task on the same page after a deploy, and it clicks the wrong button. Or it clicks nothing. Or it clicks the right button but in the wrong state, because the page loaded differently this time and the agent had no way to know.

The demo was real. The reliability was not.

What agents actually see

When an AI agent looks at a webpage, it does not see what you see. You see a page. You see layout, hierarchy, relationships, context. You know that the orange button in the hero is the primary CTA because you understand the visual grammar of the page. You know that the dropdown is closed because you saw it close. You know that the element you care about is inside a modal because you opened the modal to get to it.

An agent sees one of three things, depending on how it was built: a screenshot, a DOM dump, or a selector. Each one carries real information. None of them carry enough.

The screenshot trap

The screenshot trap

The most common way to give an agent visual context is a screenshot. Take a capture of the page, pass it to a vision model, ask it to find the element you described.

This works surprisingly well on simple pages. A vision model can identify a prominent button in a clean hero section. It can find a search bar at the top of the page. It can locate a navigation menu. For the demos, this is enough.

The problems start when the page gets real.

A screenshot is a grid of pixels with no structure. The agent does not know that the blue rectangle at coordinates (340, 580) is a button. It infers that it is a button because it looks like one. But "looks like a button" is not the same as "is a button." A decorative element can look like a button. A disabled button can look like an active one. Two buttons can look identical but be different elements with different behaviors, one submits a form and one opens a modal.

Worse, vision models do not just miss things. They sometimes invent them. Anti-aliasing artifacts, compression noise, and overlapping elements can cause a model to "see" a button that does not exist. The model is not lying. It is pattern-matching on pixel data, and pixel data is noisy. A hallucinated element is indistinguishable from a real one in the model's output.

The vision model is doing pattern matching on pixels. It is not reading the DOM. It does not know the element's tag, its role, its event listeners, or its position in the page hierarchy. It does not know whether the element is inside an iframe, behind a modal overlay, or rendered by a client-side framework that has not finished hydrating.

And the screenshot is one moment. If the agent needs to interact with an element that only appears after a hover, or after scrolling past the fold, or after a specific sequence of clicks, the screenshot does not contain it. The agent has to guess when to take the next screenshot, what action might reveal the target, and whether the thing it sees after the action is the thing it was looking for.

The screenshot is a photograph of a moment. The agent needs a map of a system.

The DOM dump problem

The DOM dump problem

Some agent frameworks skip the visual approach and work directly with the DOM. They serialize the page's HTML, pass it to a language model, and ask it to find the right element by reading the markup.

This sounds more precise, and sometimes it is. The model can see element tags, attributes, class names, IDs. It can reason about hierarchy. It can identify a button by its tag name and its text content.

To their credit, modern agent frameworks are getting smarter about this. Tools like Playwright and Puppeteer can tap into the Accessibility Tree (AOM), which strips away presentational noise and surfaces semantic roles like button "Add to Cart" instead of raw div soup. This is a real improvement. But the AOM is only as good as the page's accessibility markup, and most production pages have gaps, unlabeled elements, generic roles, and interactive components that the accessibility layer simply does not describe.

But a real page's DOM is enormous. A typical e-commerce product page has thousands of nodes. A modern React application renders deeply nested component trees with generated class names, data attributes from a dozen libraries, and wrapper divs that exist only for layout. The signal-to-noise ratio is terrible.

When you hand a language model a serialized DOM with 4,000 nodes and ask it to find "the add to cart button," it is doing the same thing your teammate does when they get a vague Slack message: scanning, inferring, guessing. The model might find three elements that could be the add-to-cart button. It picks one. If it picks right, the demo works. If it picks wrong, the workflow breaks silently, because the wrong button still does something, just not the right something.

And the DOM dump captures structure but not state. It does not tell the agent what the page looked like before the current render. It does not capture scroll position, authentication status, or which elements have been interacted with. Two identical DOM snapshots can represent completely different user experiences depending on the state that produced them.

The selector gamble

Related workflows

Move from editorial context into the selector, Playwright, and bug-reproduction pages that turn exact UI evidence into action.

Stop Explaining The Same Element Twice.

Samelogic gives your team and your AI one shared understanding of every UI element. Capture once. No more guessing.

Install the Chrome Extension
Visual
Semantic
Behavioral

Used by teams at

  • abbott logo
  • accenture logo
  • aaaauto logo
  • abenson logo
  • bbva logo
  • bosch logo
  • brex logo
  • cat logo
  • carestack logo
  • cisco logo
  • cmacgm logo
  • disney logo
  • equipifi logo
  • formlabs logo
  • heap logo
  • honda logo
  • microsoft logo
  • procterandgamble logo
  • repsol logo
  • s&p logo
  • saintgobain logo
  • scaleai logo
  • scotiabank logo
  • shopify logo
  • toptal logo
  • zoominfo logo
  • zurichinsurance logo
  • geely logo