Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly — no SDK changes, no retries, and it holds none of your API keys.
LLM streams don't send one JSON document — they send many delta events your SDK
reassembles, and the tool-call arguments (or structured-output content)
is only valid once the whole stream arrives. When the model hits max_tokens,
blows the context window, or the socket dies, you're left parsing this:
{"city": "Par // ← unterminated → your parser throws
If you've seen any of these, Suture is for you:
Suture sits between your app and the provider, watches the stream, and emits exactly the characters needed to close the reassembled JSON — as a final, well-formed delta before the terminator. Your client reassembles valid JSON and never knows anything was wrong. Added overhead is ~10 µs of CPU per chunk.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])
# that's the only change.
cargo install suture-repair # installs the `suture` binary suture # listens on 127.0.0.1:8787
Or use the byte-level repair engine as a library:
cargo add suture-repair-core. Written in Rust, dual-licensed MIT / Apache-2.0.