api designJun 28, 20264 min read

JSON is not HTML: preserving response shape in a scraper API

If the target returns application/json, your scraper result should not pretend it captured a web page.

Scantir Engineeringjson · api design · response bodies

One body field is not enough

A scraper sees many response shapes: HTML pages, JSON APIs, plain text, CSV exports, redirects, and challenge pages. Pushing all of that into an html field creates confusion for users and makes integrations brittle.

The response contract should preserve what the target returned. Content type and body type are not decoration; they are how clients decide whether to parse JSON, render markup, or store a blob as-is.

The minimum useful contract

For a developer-facing scraping API, the response should separate transport metadata from body content. The user needs to know the final URL, target status, method used, upstream headers, content type, and whether the body is html, json, or text.

headers: upstream target headers.
content_type: the target Content-Type value.
body_type: html, json, or text.
body: non-HTML raw response body.
html: rendered or fetched HTML only when the target is actually HTML.

Why this matters in a console

A playground should label the body tab based on the target response. A JSON endpoint deserves a JSON tab with pretty formatting. An HTML page deserves an HTML tab. A screenshot is useful only when the browser tier ran.

Small UI labels carry real operational meaning. They tell the user whether the system made a direct request or rendered a page, and whether the returned body is ready for JSON parsing.