Getting LLM outputs in a structured format enables easy parsing and integration into other workflows. This has driven a strong push to parse LLM outputs into universally usable JSON schemas.
In the past, engineers had to manually write custom parsers to wrangle unstructured LLM text into valid JSON objects. This solution was unreliable because it needed constant tuning for many edge cases and re-engineering for each new model, significantly bogging down development.
So, why not just prompt the model to produce outputs in the correct JSON format? Prompt engineering does work for some use cases. However, the same prompt mileage varies from model to model.
For example, ChatGPT gave a consistent JSON output for the prompt “Give me a recommendation of 3 exercises, stating its sets and reps, give it in a JSON format. No extra fields, be consistent.”
But when we tested the same prompt with Llama 70B-chat, it produced extra sentences that needed to be parsed out. Even slightly modifying the prompt completely changed the JSON schema: exercises were returned as part of a list in a JSON object. The model is clearly sensitive to any changes in the prompt.
What is worse, a single missed comma, quotation mark, bracket, or even type casting would yield incorrect outputs. This fragility places overly restrictive guardrails on the LLM, making it challenging to achieve the consistent, reliable structured outputs required for production workflows.
LLMs are largely based on the transformer architecture, which uses an auto-regressive generator. Transformer treats each word as a token and generates one token at a time. For example, when prompted “Who is Elon Musk”, the model generates one token at a time (to make it simple, we consider each word a token). The LLM cannot go back and correct the output once generated, which makes consistent JSON outputs very difficult.
The value of an LLM is returning outputs tailored to its training data and prompt. The model actually only needs to populate the fields, not the schemas. Thus, a common approach is to separate the schema generation from the model generation, including two parts: 1) insert schema tokens; 2) generate tokens from the model. We then add the tokens directly to enforce the schema.
This approach has two benefits:
1. Speed boost: Less compute is needed since schema token can be added without model processing.
2. Bulletproof JSON: Output is guaranteed to be a valid JSON object, requiring little to no parsing.
While open source solutions like JSONformer  can separate schema and model generation, they lack support for batching, streaming, and do not use a KV cache. As a result, the model still has to process each request individually.
We use a state machine to tackle these key challenges:
Batching support: Existing implementations only allow JSON outputs for single inputs, drastically reducing inference speed. Our solution enables parallel processing of batched inputs.
KV cache support: The KV cache is the memory of the transformer, storing key-value pairs. For example, if the key is a word’s position, the value is the corresponding information about that word. Without caching, each output wastes significant compute to re-process known information.
Custom model generation for precise insertion: Hugging Face’s model generation doesn’t have token-level granularity, returning only the completed output, making it difficult to skip the token generation and insert schema tokens. Our custom schema generation method enables precise schema insertion, more details below!
Guaranteed JSON output: OpenAI’s JSON mode tries its best but cannot guarantee to match your schema. This leaves unpredictable edge case failures that can break your system. Our approach strictly follows user-defined schemas, providing the reliability required for production workflows.
We designed a state machine to ping-pong between insertion from the schema generator and generation from the model generator. This state machine logic enables us to support batching of multiple prompts, since we are able to maintain the state information of each prompt in the batch.
Processing a batch of requests is simple, just add a new state machine for each request in the batch, and clock each state machine separately for each token.
To leverage caching, we update the KV cache, and other necessary model states, with each new schema token before model generation.
A key challenge with batching is each prompt may be at a different state – some needing schema tokens, some filling the output using the model generator. Therefore, it is important to make sure the updated cache matches the state for each prompt.
Last but not least, the model generation needed to support different field types like integers, floats, booleans, and strings. The raw output of the model is called logits, processed by the logits processor to produce the final prediction for the next token. We needed to carefully select the logits processor and stopping criteria for each field type.
The logits processor allows controlling the types of tokens generated by zeroing out probabilities of unwanted tokens. For example, when producing integers, we zero all non-numeric tokens so only numbers are possible. Generating booleans is more complex, as many token combinations can represent them. Using previous tokens, we zero probabilities that would cause invalid booleans. However, we should not zero out the end-of-sentence (EOS) token, used by the stopping criteria to signal the end of the model generation.
Looking back, this project evolved significantly from our initial conception. Our early experimentation revealed drawbacks in previous methods, which informed our algorithm design.
There were challenges like lack of batching support and difficulties mapping tokens to words/characters. Tokenization sensitivities, for example, an extra space in “ was” vs “was”, could lead to completely different tokens, hurting model performance.
Tokenization is integral in LLMs and although we could not alter the tokenization for existing models, we ran many experiments to ensure the model generated correctly without degrading performance.
Try our API to guarantee valid JSON output now: https://lamini-ai.github.io/rest_api/completions/
Going forward, we plan to extend support for more output types, such as lists and objects, enabling even more seamless LLM integration into production workflows. One remaining issue with lists is running up against the token limit. LLMs typically limit the max number of tokens generated. What happens if this limit is reached before the end of the list?
We welcome any feedback, feature requests, or questions you may have!
 JSONformer: https://github.com/1rgs/jsonformer
 OpenAI JSON Mode: https://platform.openai.com/docs/guides/text-generation/json-mode
Founding Engineer at Lamini
November 21, 2023