Summary: The best structured input format for an LLM prompt depends on the shape of the data, but our research found a clear production winner for large evidence-heavy workflows: lossless evidence aliases. They reduced input size while preserving output integrity: accurate answers, complete required fields, stable citations, proper response length, and schema-valid output. TOON-style tables are a strong general-purpose runner-up for uniform object arrays. CSV/TSV works for flat data. XML tags are useful for separating prompt sections. YAML and raw JSON should be used more selectively. Keep JSON at the API boundary, not as the default prompt body.
Quick Answer: What is the best structured prompt format?
For large production workflows, the #1 recommendation is a lossless evidence-alias pattern: define repeated metadata once, encode evidence as compact rows, preserve stable citation IDs, validate that the encoded prompt can reconstruct the original records, and test that model outputs remain accurate, complete, and on target.
This is a pattern, not a universal file-format standard. If your payload is a simple uniform array and you do not need audit-grade citation reconstruction, TOON or CSV/TSV may be the more practical choice. If your payload is small, irregular, or deeply nested, minified JSON may be better. The ranking below is for evidence-heavy LLM inputs where correctness, traceability, and output quality matter as much as token count.
The practical ranking is:
| Rank | Format | Best for | Verdict |
|---|---|---|---|
| 1 | Lossless evidence aliases | Large auditable workflows with files, records, citations, and repeated metadata | Best balance of token reduction and output integrity |
| 2 | TOON-style tables | Uniform arrays of objects | Strong compact format when rows are regular and output schema is explicit |
| 3 | CSV / TSV plus metadata blocks | Flat tabular data | Efficient and accurate for simple tables, weak for nested evidence |
| 4 | XML-tagged prompt sections | Separating instructions, examples, context, and documents | Improves instruction adherence, not a data-compression format |
| 5 | YAML | Human-readable configs and moderate nesting | Useful for humans, weaker for strict machine contracts |
| 6 | Minified JSON | Small, nested, irregular objects | Safe fallback, still pays repeated-key cost |
| 7 | Pretty JSON | Debugging and human review | Worst default for large prompts; wastes context that could support better output |
The ranking criteria were:
- token reduction
- source-data preservation
- output accuracy and completeness
- citation validity
- schema adherence
- response length control
- implementation complexity
- fit for the data shape
The winning architecture is not one format everywhere. It is this:
application JSON
-> lossless evidence-alias encoder for model input
-> compact LLM prompt
-> provider structured output / schema validation
-> application JSON
Structured outputs control what comes out of the model. Evidence-alias encoding controls what you spend sending data in.
Should you send JSON to an LLM?
Use JSON at the boundary. Do not automatically use JSON inside the prompt.
If your application needs a typed response, use provider structured-output features: OpenAI Structured Outputs, Gemini response schemas, Claude JSON outputs, or strict tool use. If your application needs to send a large evidence set into the model, compress the input first with a lossless prompt encoding that removes repeated structure while preserving every source record.
Why engineers default to JSON
JSON is the obvious choice because every production system already understands it. It is portable, machine-readable, schema-compatible, easy to validate, easy to log, and easy to pass across APIs.
That is exactly why it should stay in the system.
The mistake is assuming that the same representation is optimal for the model’s attention and tokenizer. JSON was designed for software interchange, not for stuffing hundreds of near-identical records into a context window.
Consider a common agent workflow:
[
{
"file_id": "f_01H...",
"filename": "invoice_packet.pdf",
"segment_id": "s_01H...",
"page_start": 4,
"page_end": 5,
"schema_id": "document_extraction_v3",
"extraction_model": "example-extraction-model",
"customer_name": "Example Customer Inc.",
"field_name": "invoice_total",
"field_value": "12400",
"confidence": 0.98
},
{
"file_id": "f_01H...",
"filename": "invoice_packet.pdf",
"segment_id": "s_01H...",
"page_start": 4,
"page_end": 5,
"schema_id": "document_extraction_v3",
"extraction_model": "example-extraction-model",
"customer_name": "Example Customer Inc.",
"field_name": "payment_terms",
"field_value": "net_30",
"confidence": 0.97
}
]
Most of that text is repeated scaffolding. The LLM needs the evidence. It does not need the same file name, schema ID, model name, and object keys repeated on every row.
The benchmark
We tested this on a representative document-intelligence payload with extracted records, source files, segment references, evidence references, and policy-evaluation context. The examples here are generalized so the pattern is easier to apply across domains.
The baseline was a raw JSON-style prompt with repeated object keys and repeated metadata.
| Format | Relative input size | Reduction vs. raw JSON | Data preserved | Output integrity |
|---|---|---|---|---|
| Raw repeated JSON prompt | 1.00x | baseline | All extracted records | Accurate on small inputs; more likely to waste context on large repeated payloads |
| Lossless evidence aliases | ~0.25x | ~75% | All records and original data cells | Strongest balance of accuracy, completeness, citations, and length control |
| Evidence aliases plus response schema | ~0.25x | ~75% | All records and original data cells | Strongest production pattern when paired with provider structured outputs |
The key is that the compact versions were not summaries. The manifest preserved:
- every file extraction record
- every file reference
- every segment reference
- every evidence alias
- repeated string aliases
- original data key/value cells
This is prompt compression by representation, not by deletion.
Output integrity matters as much as token count
The goal is not to make prompts smaller at any cost. A compressed prompt is only useful if the answer still lands correctly.
In our testing, each format was evaluated across output quality dimensions, not just input size:
| Integrity check | What it tests | Why it matters |
|---|---|---|
| Source preservation | Can the encoded prompt reconstruct the original records? | Prevents silent data loss |
| Required field coverage | Does the answer include every required output field? | Avoids incomplete downstream payloads |
| Semantic accuracy | Does the model reach the same conclusion as the reference answer? | Measures whether compression changed the meaning |
| Citation validity | Do evidence references map back to real source records? | Keeps the answer auditable |
| Length adherence | Is the answer the right level of detail, not too short or bloated? | Prevents unusable summaries and runaway responses |
| Schema validity | Does the response conform to the expected output contract? | Keeps application parsing reliable |
| Boundary adherence | Does the model avoid unsupported claims outside the evidence? | Reduces hallucinated facts |
This changed the ranking. A format that saves tokens but causes omitted fields, broken citations, or under-specified answers is not a winner. The best formats were the ones that reduced repeated structure while making the model’s job clearer: what evidence exists, how records relate, what must be cited, and what shape the answer must take.
That is why evidence aliases ranked first. They did not merely shrink the input. They preserved the evidence map, gave the model compact handles for claims, and left more room for task instructions, output requirements, and response schema. TOON-style and CSV/TSV formats also performed well when the data was regular, but they needed explicit metadata and schema rules to maintain the same citation and completeness guarantees.
Raw JSON remained useful as a safe interchange format, but it was not automatically better for output quality. On large repeated payloads, raw JSON can spend too much context on serialization scaffolding. That leaves less room for the instructions that keep the answer accurate, complete, properly scoped, and appropriately sized.
What external research says
The broader engineering literature supports the direction of this pattern, but it also adds caveats that matter in production.
First, TOON is real and useful, but its best use case is narrower than some of the hype suggests. The TOON project describes it as a lossless input representation for JSON data, with a sweet spot in uniform arrays of objects. Its own benchmark focuses on comprehension and data retrieval: the model receives formatted data and answers questions about it. That is a reading task, not a test of whether models should generate TOON as an output format.
A 2026 TOON-vs-JSON generation benchmark is more cautious. It found that TOON can have a promising accuracy-per-token profile, but that its advantage can be reduced by the extra prompt instructions needed to teach or constrain the format. The paper also found that plain JSON generation had the best final accuracy in some settings, while constrained JSON had the lowest output token budget with some accuracy tradeoffs. That matches the recommendation here: compact formats are best for model input, but JSON Schema and structured outputs are still the safer production boundary for model output.
Second, the idea of declaring schema once and sending many rows is showing up outside TOON. ONTO, a columnar notation proposed for LLM input optimization, uses the same core design: declare field names once, arrange values in compact rows, and preserve hierarchy with indentation. Its reported reductions versus JSON and comprehension checks support the broader point: repeated object keys are often the main source of JSON overhead, and row-oriented encodings can preserve task accuracy when the format context is clear.
Third, prompt compression research is more skeptical than token-savings blog posts. “Prompt Compression in the Wild” found that real latency gains depend on whether compression overhead is offset by faster decoding. “The Compression Paradox in LLM Inference” found that aggressive compression can cause quality loss and provider-dependent behavior. This is why the benchmark criteria above include output integrity, unsupported claim rate, response length adherence, and cost per successful task. Fewer input tokens are not automatically better.
Fourth, provider structured-output features are output controls, not input-compression strategies. OpenAI’s Structured Outputs documentation distinguishes JSON mode from schema adherence: JSON mode can produce valid JSON without guaranteeing that it matches a schema, while Structured Outputs are designed for schema matching. Gemini’s structured-output docs similarly support a subset of JSON Schema. These tools are exactly what you want at the response boundary, but they do not remove repeated keys from the prompt you send into the model.
Structured-output benchmarks reinforce the same point. StructEval evaluates formats such as JSON, YAML, XML, CSV, HTML, and SVG using syntax and structural-correctness metrics, and SoEval was created specifically because structured-output capability was under-measured in general LLM benchmarks. The implication for engineering teams is straightforward: do not assume “looks structured” means “is valid, complete, and correct.” Parse it, validate it, and score it against the task.
Finally, long-context research gives another reason to avoid bloated prompts, but it should not be overstated. “Lost in the Middle” showed that models can struggle to use information placed in the middle of long contexts. Newer long-context models have improved on simple retrieval tasks, so the argument is not “models cannot read long context.” The practical point is narrower: if repeated serialization scaffolding consumes context, cost, and attention, remove it before it crowds out evidence and instructions.
Why raw JSON burns tokens
Tokenizers do not understand that repeated JSON keys are “free.” They see text. The OpenAI tokenizer guide puts it plainly: models see text as tokens, and token counts determine whether an input fits and what it costs. That means "filename" repeated hundreds of times is not metadata. It is billable context.
JSON bloat usually comes from five places:
- Repeated object keys
- Repeated metadata values
- Nested object structure
- Verbose IDs and URLs
- Pretty-printing and indentation
Minifying JSON helps, but only at the margins. It removes whitespace. It does not remove repeated keys or repeated values.
The better pattern: evidence aliases
Evidence alias encoding normalizes the prompt the same way a database normalizes repeated data.
Instead of repeating file and segment metadata on every record, define it once:
FILE_REF
F001|invoice_packet.pdf|uploaded_document
F002|payment_terms.csv|system_export
SEGMENT_REF
S001|F001|page=1|status=completed|confidence=0.98
S002|F002|page=1|status=completed|confidence=0.97
EVIDENCE_REF
E001|S001
E002|S002
Then send the data as rows:
INVOICE_DATA
eid|field|value|unit|period
E001|invoice_total|12400|USD|2026-04
E001|payment_terms|net_30|text|2026-04
E001|due_date|2026-05-15|date|2026-04
E002|approved_limit|15000|USD|2026-04
The model can still cite E001. The application can still expand E001 back to the exact file, segment, page, confidence, and extraction record. The prompt just stops paying to repeat that metadata every time.
Why this is different from summarization
Summarization reduces the prompt by discarding detail.
Evidence aliases reduce the prompt by eliminating duplication.
That distinction matters for regulated workflows, financial analysis, legal review, security, healthcare, and any use case where the answer needs to be auditable. If the model says “the invoice amount exceeds the approved limit,” the system must be able to show which source file and value supported that statement. A compressed summary cannot always do that. A lossless alias can.
The encoder should pass a reconstruction test:
raw extraction JSON
-> encode to prompt aliases
-> decode aliases back to canonical JSON
-> compare hashes and record counts
If the decoded version does not match the canonical source, the format is not lossless enough for production.
Where JSON still belongs
This argument is not “replace JSON everywhere.”
JSON is still the right format for:
- API requests and responses
- database persistence
- validation contracts
- typed SDKs
- event streams
- audit exports
- model output schemas
The right split is:
Use compact structured text for model input.
Use JSON Schema for model output.
Use JSON for application boundaries.
OpenAI’s structured-output guidance makes the output side clear: JSON mode only guarantees valid JSON, while Structured Outputs match the response to a schema when supported. Gemini’s structured outputs similarly let developers provide a JSON Schema so the model returns predictable, type-safe JSON. Claude now supports JSON outputs and strict tool use for schema validation.
Those features are excellent. They do not solve input token bloat by themselves.
JSON mode is not the same thing as structured outputs
Many teams conflate three different things:
- “Please respond in JSON.”
- JSON mode.
- Strict schema-constrained structured outputs.
They are not equivalent.
OpenAI documents the distinction directly: JSON mode ensures valid JSON, but it does not guarantee that the output matches a specific schema. Structured Outputs are the stronger feature for schema adherence. Gemini’s docs make a similar point from the provider side: structured outputs can produce syntactically valid JSON matching a provided schema, but application code still needs semantic validation. Claude distinguishes JSON outputs from strict tool use: one controls the final response format, the other validates tool parameters.
In production, the safe pattern is:
Prompt: compact lossless input
Output: provider structured output
Application: validate business semantics anyway
Schema conformance does not mean the model reasoned correctly. It means the result is shaped correctly enough for your code to inspect it.
The ranked recommendations
Here is the practical order we would recommend to an engineering team choosing a structured input format for LLM prompts.
1. Lossless evidence aliases
This is the top recommendation for production agent workflows where the model receives many extracted records, files, citations, policies, or source references.
Evidence aliases win because they remove repeated structure without removing evidence. The encoder declares repeated file metadata, segment metadata, string values, and source references once. The data rows then cite compact IDs such as F001, S001, A001, and E001.
Why this is the best choice:
- It produced the strongest result in our representative benchmark: roughly 75% fewer input tokens while preserving the records and data cells needed for reconstruction.
- It keeps source citations stable, which matters for regulated, financial, legal, healthcare, and security workflows.
- It is lossless when implemented with a decoder and reconstruction test.
- It supported accurate, complete, properly scoped outputs because the model could cite compact evidence IDs instead of scanning repeated JSON scaffolding.
- It improved length control by leaving more context for task instructions, output constraints, and response schemas.
- It lets the model reason over compact evidence while the application keeps canonical JSON for storage, validation, and audit.
The tradeoff is engineering discipline. You need an encoder, a decoder, a manifest, token tests, and clear reconstruction rules. That is worth it for serious workflows because it turns prompt compression into software infrastructure instead of prompt hacking.
2. TOON-style table formats
TOON, or Token-Oriented Object Notation, is the best general-purpose alternative when your input is a uniform array of objects. Its core idea is similar to evidence aliases: declare fields once, then stream rows.
Why it ranks highly:
- It directly targets the biggest JSON problem: repeated keys in arrays of similar objects.
- It is more readable than dense custom encodings.
- It can be easier to adopt than building a domain-specific alias format from scratch.
- It is a strong fit for product catalogs, events, extracted entities, search results, and other repeated object lists.
- It can preserve output accuracy when the row shape is regular and the prompt explains how fields map into the output schema.
The limitation is that TOON is not automatically the best format for every shape. Deeply nested, irregular, or small objects may not benefit enough to justify the extra instructions. It is also a format choice, not a full audit system. If your workflow needs citation-preserving reconstruction, TOON may need additional evidence IDs and manifests layered on top.
3. CSV / TSV plus metadata blocks
CSV and TSV are excellent when the data is actually tabular. They are hard to beat for compactness when every row has the same columns.
Why they work:
- Column names appear once.
- Rows are compact.
- Models generally understand tables.
- They are simple to generate and inspect.
- They can produce concise, on-target answers when the task is a straightforward table evaluation.
The weakness is semantics. CSV does not naturally express nested objects, missing-vs-null distinctions, evidence provenance, data types, or per-cell confidence. For production use, pair tables with explicit metadata blocks:
TABLE: invoice_review
COLUMNS: evidence_id,field,value,threshold,status
NULL_RULES: __MISSING__ means absent key; null means explicit source null
UNITS: value and threshold are reported in the units column when applicable
This ranks below TOON and evidence aliases because it needs extra conventions once the workflow becomes more than a flat table.
4. XML-tagged prompt sections
XML tags are a strong prompt-organization tool. They are especially useful when you need to separate instructions, context, examples, documents, tool results, and output rules.
Why they are useful:
- Tags make boundaries explicit.
- They reduce instruction/context confusion.
- They are easy to nest around documents and examples.
- Anthropic explicitly recommends XML tags for structuring complex prompts.
But XML tags are not primarily a token-optimization format. They can make a prompt more reliable while also making large tabular payloads more verbose. Use XML-like tags for prompt boundaries; do not use XML as the main encoding for thousands of repeated data fields unless the task specifically benefits from that structure.
5. YAML
YAML is useful when humans need to read and edit the prompt payload. It can be more compact and readable than pretty JSON for moderate configuration objects.
Why it can help:
- It is easier for humans to scan.
- It avoids some JSON punctuation.
- It is convenient for configuration, rules, and small nested structures.
Why it ranks lower:
- Whitespace and indentation matter.
- Ambiguity around scalars can create parsing surprises.
- It is not as strong as JSON Schema or provider structured outputs for hard machine contracts.
- Large arrays still repeat keys unless you restructure them.
YAML is a good authoring format. It is not the best answer for high-volume evidence encoding.
6. Minified JSON
Minified JSON is the best fallback when the object is small, irregular, deeply nested, or when your team cannot safely introduce another representation yet.
Why it remains useful:
- Every system understands it.
- It preserves exact machine semantics.
- It is easy to validate.
- It avoids whitespace waste.
- It can be the most reliable option for small, irregular payloads where token bloat is not the limiting factor.
The problem is that minification only removes whitespace. It does not solve repeated keys, repeated metadata, or repeated IDs. For a few objects, that is fine. For hundreds of evidence records, it is usually the wrong default.
7. Pretty JSON
Pretty JSON belongs in logs, docs, debugging, and human review. It should be the last choice for large production prompts.
Why it ranks last:
- It repeats every key.
- It adds whitespace and indentation.
- It often repeats metadata across records.
- It can push workflows over context limits while adding no model-relevant information.
Pretty JSON is comfortable for engineers, but comfort is not the same as prompt efficiency.
The winners from this research
The research points to a three-part production pattern:
- Use lossless evidence aliases as the default format for large, auditable LLM inputs.
- Use TOON-style tables or CSV/TSV for simpler uniform arrays where full evidence reconstruction is not required.
- Score every format on output integrity: accuracy, completeness, citation validity, length adherence, and schema validity.
- Use provider structured outputs to return validated JSON at the application boundary.
That is the important distinction. The input format should be optimized for the model’s context window. The output format should be optimized for application validation.
A production encoder design
A production encoder should treat prompt format as an interface with tests.
1. Normalize repeated metadata
Create reference tables:
CONFIG_REF
C001|model=example-model|schema=document_extraction_v2
FILE_REF
F001|file_id=...|filename=...|category=...
SEGMENT_REF
S001|F001|page_start=1|page_end=2|confidence=0.98|C001
2. Alias repeated strings
Customer names, policy names, document categories, reporting periods, and source labels repeat constantly.
STRING_ALIAS
A001|Example Customer Inc.
A002|Payment Terms Review
A003|April 2026
Then rows can use @A001 instead of repeating long strings.
3. Keep evidence IDs stable
Every row that may support a claim needs an evidence handle:
EVIDENCE_ALIAS
E001|S001|source=invoice_packet.pdf
The LLM should cite E001, not invent a citation.
4. Preserve null vs. missing
Production encoders need boring details:
__MISSING__ = key did not exist in source record
null = key existed with explicit JSON null
If you collapse these, you lose data semantics.
5. Validate round-trip integrity
The encoder should emit a manifest:
{
"source_sha256": "...",
"records_preserved": "...",
"data_cells_preserved": "...",
"file_refs": "...",
"segment_refs": "...",
"evidence_aliases": "...",
"string_aliases": "..."
}
This makes token optimization testable instead of aesthetic.
Prompt template
A compact prompt should make the reconstruction rules obvious:
TASK
Review the document package and identify any policy exceptions.
Use only the evidence rows and reference tables below.
Return evidence IDs for every material claim.
RECONSTRUCTION RULES
- E### joins to EVIDENCE_ALIAS.eid.
- EVIDENCE_ALIAS.sid joins to SEGMENT_REF.sid.
- SEGMENT_REF.fid joins to FILE_REF.fid.
- @A### expands through STRING_ALIAS.
- __MISSING__ means the source key was absent.
- null means the source key was explicitly null.
FILE_REF
...
SEGMENT_REF
...
STRING_ALIAS
...
DATA_TABLES
...
Then enforce the response at the API boundary:
{
"type": "object",
"required": ["status", "reasoning", "evidence_refs"],
"additionalProperties": false,
"properties": {
"status": {
"type": "string",
"enum": ["pass", "exception", "needs_review"]
},
"reasoning": {
"type": "string"
},
"evidence_refs": {
"type": "array",
"items": { "type": "string" }
}
}
}
What to benchmark
Do not choose a prompt format from a blog post, including this one. Run your own benchmark.
Track:
- input tokens
- output tokens
- total latency
- time to first token
- schema-valid response rate
- required field coverage
- semantic correctness
- citation correctness
- response length adherence
- unsupported claim rate
- reconstruction success
- cost per successful task
Prompt-compression research has the same warning. LLMLingua showed that prompt compression can reach very high compression ratios with limited quality loss in benchmark settings. A 2026 “Prompt Compression in the Wild” study found that real speedups depend on whether compression overhead is offset by faster decoding. In other words: fewer tokens is usually good, but it is not the only metric.
For structured business workflows, add one more metric: audit survivability.
If the model makes a claim, can the application trace that claim back to a source document after compression?
Also test answer shape directly. A response can be schema-valid and still be too thin, too verbose, or off target. For production workflows, define a target output envelope before benchmarking:
OUTPUT_QUALITY_TARGETS
- include every required decision field
- include citations for every material claim
- keep summary length within the requested range
- avoid claims not supported by evidence IDs
- preserve required tables, sections, or action items
- return machine-readable JSON at the boundary
This is where compact input formats can help output quality. When repeated prompt scaffolding is removed, the model has more usable context for the instructions that keep the response complete, appropriately detailed, and parseable.
Where this matters most
This pattern is most valuable when:
- You send many records into a model.
- Records share common keys and metadata.
- You need evidence citations.
- You need downstream structured JSON.
- You are near context-window limits.
- You run the workflow often enough for token cost to matter.
Examples:
- document extraction review
- loan underwriting packets
- contract compliance monitoring
- insurance claim files
- medical necessity review
- compliance testing
- security event triage
- RAG systems with many retrieved passages
- multi-agent workflows that pass state between steps
This is less valuable when:
- The object is small.
- The data is deeply nested and irregular.
- A model already consumes a file natively.
- The LLM call is infrequent.
- Human readability is more important than token cost.
The rule of thumb
Use JSON for machines. Use compact evidence formats for models. Use schemas for outputs.
Raw JSON is a good system boundary and a bad default prompt body. If your prompt includes hundreds of repeated JSON objects, you are probably paying the model to read your serialization format instead of your data.
The fix is not to remove evidence. The fix is to encode evidence better.
Related implementation context
This article focuses on prompt input formats, but the format only works when it sits inside a larger production architecture. For regulated workflows, the surrounding system also needs a data engine that preserves source lineage, a policy engine that defines what evidence is required, and audit trails that connect every model-assisted decision back to source documents.
The same principle applies to document-heavy workflows such as document processing, policy evaluation, and building AI agents that do not hallucinate: compact inputs help, but only when extraction, validation, policy logic, and output schemas all preserve the evidence chain.
Implementation checklist
- Identify LLM calls that paste large JSON arrays into prompts.
- Count tokens for representative requests before changing anything.
- Build a lossless encoder for repeated metadata, evidence IDs, and row data.
- Add a decoder and hash-based round-trip test.
- Keep stable evidence aliases that the model can cite.
- Move output JSON schemas to provider structured-output APIs where supported.
- Validate semantic correctness after schema validation.
- Add token regression tests to CI.
- Benchmark latency and cost, not only token count.
- Document format rules so prompt changes do not break reconstruction.
Sources and further reading
- OpenAI: Structured model outputs
- OpenAI Cookbook: How to count tokens with tiktoken
- Google Gemini API: Structured outputs
- Claude API Docs: Structured outputs
- Claude API Docs: Prompting best practices and XML tags
- TOON: Token-Oriented Object Notation
- Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation
- ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
- Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
- The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression
- StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs
- Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs
- Lost in the Middle: How Language Models Use Long Contexts