I screenshot conversations all the time. “Let’s meet next week.” “Send me your resume.” “Remind me next time we catch up.” Each one a small intention to remember something. Each one a screenshot that sits in my camera roll while the thing it was supposed to remind me about quietly passes.

So I built an OpenClaw skill to fix that. Called it Scrask. I feed it screenshots over Telegram most of the time, but the skill itself lives inside OpenClaw and works wherever OpenClaw does. v1 went up in February, a few weeks after OpenClaw launched and gave me the right primitives to do it cleanly. I expected a weekend toy. About a thousand people downloaded it — surprising enough to come back and give it a proper rewrite.

This post walks through what the skill does, what the original version looked like, and the three changes I’d most want a returning user to know about.

What the skill does

You give OpenClaw a screenshot — for me, that’s almost always forwarded to Telegram, but it works in any OpenClaw interface. A Python parser reads the image, pulls out the event or task — title, date, time, location, participants — and the calendar entry shows up on the other end. If something is ambiguous, the skill asks one targeted question; if not, it doesn’t.

The whole point is to skip the “now type all this in” step that sits between a plan being made in a group chat and a commitment landing on a calendar.

What v3 actually did

The original was a simple Python code. It took a screenshot, sent it to Claude vision (Gemini wasn’t in the picture yet), parsed an event or task out of the response, and wrote directly to Google Calendar or Google Tasks via the Google API. Configuration was an Anthropic API key plus a Google service-account JSON. The Telegram reply formatting lived inside the parser.

It worked. It was also brittle in three specific ways, and those three are what v4 addresses.

v4.1 — Per-field confidence and targeted clarifications

The parser used to return a single confidence score per item. So when a screenshot was ambiguous, the only move was to ask “is this right?” — which makes you re-read everything to figure out what it was actually unsure about.

v4.1 splits confidence per field. The parser now returns a confidences{} map for title, date, time, location, participants, description, and priority — plus two top-level scores: actionable_confidence (is this even an event or task?) and type_confidence (calendar event or task list item?).

Underneath the scores, each item carries a clarifications[] array of pre-formatted, targeted questions. The shape_intent function walks the mandatory fields for the inferred type and builds a question only for fields below threshold. If the title and date are confident but the time is shaky, you get “What time is dinner with Priya?”. If everything is confident, you get no question at all — most screenshots fall into this bucket.

Why split perception from decisions

Vision models are good at perception and bad at thresholds — ask one to apply a numeric cutoff and you’ll get a different answer depending on the weather. Code is good at thresholds and incapable of perception. Pulling fields out as data and applying cutoffs in Python means I can tune behaviour without re-prompting the model.

Three thresholds drive this — ACTIONABLE_THRESHOLD, TYPE_THRESHOLD, and FIELD_THRESHOLD — all configurable as CLI flags.

v4.2 — Gemini is no longer required

This is the most boring change and the one I think will matter most.

A new parse_with_openclaw() function reads OPENCLAW_VISION_PROVIDER, OPENCLAW_VISION_KEY, and an optional OPENCLAW_VISION_MODEL from environment variables OpenClaw injects. You can pin this mode explicitly with --provider openclaw.

If you bring your own Gemini or Claude key, auto-routing still applies — Gemini-first when both are present, Claude-only when Gemini misbehaves, OpenClaw’s default when neither is configured. The auto mode is now credential-aware: it routes based on what’s in the environment instead of erroring out when a key is missing.

GEMINI_API_KEY moved from requires.env to optional_env. Lowering the floor for “just try it” mattered more than I expected.

v4.3 — Hybrid invocation

The original was a single-purpose script — you’d hand it a screenshot, it would run. Inside an agent platform like OpenClaw, that pattern doesn’t quite hold; the agent needs to know when to invoke Scrask versus another skill.

v4.3 adds a metadata.openclaw.invocation block to the skill manifest declaring mode: hybrid plus an alias list: scrask, scrask this, screenshot, screenshot to calendar. The platform contract is: check aliases first, fall back to implicit routing if no alias matched. The implicit conditions themselves are preserved verbatim, so any agent already routing on the old prose sees no behaviour change.

The architectural shift: the parser writes nothing

This is the change that doesn’t fit cleanly under any single version bump and matters more than any of them individually.

The v3 parser wrote directly to Google Calendar and Google Tasks. The v4 parser writes nothing. It emits a structured intent JSON — what the event is, where it should go, what confidence level it came in at — and OpenClaw delegates the actual save to whichever destination skill is installed: calctl for system calendars, things-mac for Things, apple-reminders, notion, whatever else you’ve wired up.

One screenshot in, one JSON plan out. Stateless. If you swap your task manager from Things to Notion next month, Scrask doesn’t change — the destination skill does. That’s the property I’m most pleased with, because it means I can ship the screenshot-reading half once and let the ecosystem handle the long tail.

Documentation

I also rewrote the docs from scratch, partly because the v3 docs no longer matched anything that was true and partly because I wanted a future me (or someone else picking this up) to be able to extend the parser without spelunking through commits.

The architecture overview is in two parts: part one is non-technical (what the skill does, where it lives, what each piece is for), part two is at the code level (file structure, the parser’s data contract, the shape_intent decision logic, why the parser is stateless). The decision flow is documented as a pair of Mermaid flowcharts — one for the parser, one for the bot side — plus a threshold reference table and three narrated end-to-end paths (happy, ambiguous, actionable-gate).

There’s also an interactive standalone HTML version. Mermaid v10 renders the flowcharts client-side, every decision node and threshold-table row is clickable, and the popups explain what each gate does with references back into the code. It’s self-contained — open the HTML file in a browser and it works.

Try it. Tell me what breaks.

The repo is at github.com/devsandip/scrask-bot.

Skill in Claw Hub: https://clawhub.ai/devsandip/scrask-bot

The most useful thing you can do is run it on a screenshot you’d actually want on your calendar and tell me what doesn’t work. I’m especially interested in the screenshots I haven’t tested well — non-English text, dense flyers, low-contrast captures, half-cropped invites, the Telegram formatting where the date is in one bubble and the time is in the next. Edge cases reported now are worth far more than edge cases discovered quietly six months from now.