Suture

Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly — no SDK changes, no retries, and it holds none of your API keys.

View on GitHub Quickstart

The problem

LLM streams don't send one JSON document — they send many delta events your SDK reassembles, and the tool-call arguments (or structured-output content) is only valid once the whole stream arrives. When the model hits max_tokens, blows the context window, or the socket dies, you're left parsing this:

{"city": "Par      // ← unterminated → your parser throws

If you've seen any of these, Suture is for you:

json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column …

json.decoder.JSONDecodeError: Expecting value: line 1 column … (char …)

serde_json::Error: EOF while parsing a string / an object

The fix

Suture sits between your app and the provider, watches the stream, and emits exactly the characters needed to close the reassembled JSON — as a final, well-formed delta before the terminator. Your client reassembles valid JSON and never knows anything was wrong. Added overhead is ~10 µs of CPU per chunk.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])
# that's the only change.

Highlights

Four providers. OpenAI, Anthropic, Google Vertex AI (Gemini + Claude), AWS Bedrock (ConverseStream).

SSE-aware. Repairs the reassembled tool-call arguments / JSON content across delta events, not raw bytes.

Compression-transparent. Decodes gzip/brotli/deflate, repairs, re-encodes — never buffers the whole body.

Holds no keys. Your credential passes through; for Bedrock, SigV4 means the secret never crosses the wire at all.

Get it

cargo install suture-repair    # installs the `suture` binary
suture                         # listens on 127.0.0.1:8787

Or use the byte-level repair engine as a library: cargo add suture-repair-core. Written in Rust, dual-licensed MIT / Apache-2.0.