Why Does My LLM Structured Output Perform Poorly?

Link
Getting structured output from Large Language Models is crucial for many real-world applications, yet it remains one of the most challenging aspects of working with LLMs.
Through my experience with different structured output methods, I've discovered that there's no universal solution. This article examines three distinct approaches I've implemented: tool calling APIs, XML-based methods, and custom Domain-Specific Languages (DSLs). Each has unique strengths and limitations that make them suitable for different scenarios.

1. Tool Calling API

This is the most common method for generating structured output from LLMs. However, some LLMs do not well support the tool calling API and we may need to use ReAct to do it. I created a Python intermediate layer that functions like a tool calling API but uses a prompt-based method under the hood: https://github.com/BeautyyuYanli/tooluser.
The biggest problem with the tool calling API is that it is trained differently from other parts of the LLM.
Text responses, reasoning, and tool calling responses are trained differently, so we often find that the LLM thinks in one direction but responds in a completely different way (especially in early versions of the O-series and DeepSeek R1). The same problem exists for tool use APIs, but in a relatively implicit way.
One example is role-playing applications. The LLM should have good literary expression while also providing structured output for roles' thoughts, actions, etc. If we use the tool calling API in these cases, the LLM will lose literary expression entirely.

2. Common XML

To solve the problem addressed above, I found that XML is an effective method. XML itself is friendly to text expression (no quotes, no escape problems), and that's the reason HTML was invented.
For LLMs, I simply use prompts to describe the XML output structure and employ a validator for the LLM text response. By using XML, the LLM can maintain excellent literary expression within structured output. I created a simple Python library for my usage: https://github.com/BeautyyuYanli/qwq-tag

3. Custom DSL

In contrast to literary expression, another use case for LLMs is producing complex structured output. If the structure is JSON or a mainstream DSL like PostgreSQL, the LLM can easily produce the desired result. However, in my case, I want to produce ISO GQL (Graph Query Language, a SQL-like language for graph databases). The LLM has limited knowledge about it and can hardly understand my instructions. This raises two problems: how to validate the output, and how to help the LLM produce better output?
The core principle is to create intermediate expressions. I created a two-step workflow: the first step translates the instruction to pseudo-code, then produces the DSL from the pseudo-code. It works like a <think> process before answering, but I prompt the model to use pseudo-code as the <think> content. This method significantly enhances output quality.
To validate the LLM output, I map the DSL to JSON and use the tool calling API to produce the JSON. Another approach is to use an AST-based checker.
However, if the LLM struggles to create correct grammar, it will require a loop of trial-and-error to fix problems, which may harm the quality of the final result. Additionally, creating a good checker with informative, human/LLM-readable error messages is not trivial.
Guided generation for LLMs, such as that supported by vLLM, may be another promising solution.

In Conclusion

Each structured output approach serves different needs and priorities. The choice ultimately depends on your specific requirements.
 
 

© Yanli 盐粒 2022 - 2025