Your coding agent can't read your screenshot

Pasting an image into a prompt is the least reliable way to give a coding agent visual context. A markdown document with the image embedded is the format that survives.

This is part of the answers hub. If you've pasted a screenshot into Cursor, Claude Code, or another agent and it ignored the image or hallucinated what was on screen, the problem isn't your screenshot. It's the delivery format.

Why pasting an image into a prompt fails

A pasted image is ephemeral. Even agents that accept images tend to lose them across turns: after one file edit, or in the next session, the visual context is gone and the agent reasons from a fading memory of it. There's no artifact it can re-open. So it guesses — and a confident wrong guess is worse than no answer.

Text-only agents are worse still: the image is simply dropped, and the agent silently proceeds on the words alone.

What to send instead

Hand the agent a markdown document with the screenshot embedded and the finding written next to it. This works because:

It's a file, not a moment. The agent can re-read it on any turn. The context doesn't evaporate.
It's agent-readable. Markdown is the native input format for every coding agent. No preprocessing.
The words and the screen travel together. "This button is misaligned" sits right beside the image of the button. No reconstruction required.
Each finding is a section. The agent treats each H2 as a discrete task, so a multi-point review becomes a clean to-do list.

The shortest path

What is agent-readable feedback explains the format in depth, and the tool-specific guides for Cursor and Claude Code show exactly where the document goes. CobaltCapture produces this markdown for you: capture the screen, talk through the problem, get a URL your agent can read.

Frequently asked questions

Cursor and Claude Code accept images, so why doesn't pasting work?

They can sometimes read an image in the moment, but the context rarely persists. On the next turn, after a file edit, or in a fresh session, the image is gone and the agent is reasoning from memory. A screenshot in a file the agent can re-read survives the whole session.

What format actually works?

Markdown with the screenshot embedded as an image reference, and the finding written as text right next to it. The agent reads the markdown linearly, follows the image URL for visual context, and treats each section as a discrete task. That's what CobaltCapture produces.

Can't I just describe the screen in words?

You can, but you lose the exact state. 'The button here is misaligned' means nothing without the screen. Words plus the screen (visual feedback) is what removes the guessing.

Does the agent need to download the image?

For vision-capable agents, a public image URL in the markdown is enough, they fetch it. For text-only agents, the surrounding written description carries the finding even when the image isn't fetched. Either way, the markdown is the durable artifact.

Capture your first review.

About a minute from open tab to a shareable URL your agent can ingest.

Start capturing