feat(core): tolerant JSON parsing for LLM output (repair_json, loads_robust)#3120
Open
Bartok9 wants to merge 1 commit into
Open
feat(core): tolerant JSON parsing for LLM output (repair_json, loads_robust)#3120Bartok9 wants to merge 1 commit into
Bartok9 wants to merge 1 commit into
Conversation
…ads_robust) LLM responses frequently emit JSON wrapped in Markdown code fences, with // or /* */ comments, or with trailing commas before } or ] - all of which break json.loads. Add two stdlib-only helpers in dbgpt.util.json_utils: - repair_json(text): strips a surrounding code fence, removes comments, and removes trailing commas, all while preserving double-quoted string contents. - loads_robust(text, **kwargs): json.loads first, then retries on the repaired text only if strict parsing fails - so well-formed input is unaffected. Covered by 12 new parametrized/unit tests. No breaking changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
LLM responses very often emit JSON that is almost valid but breaks
json.loads:```json … ```),//line comments or/* … */block comments,}or].These defects are a recurring source of parse failures in the agent / output-parsing
paths. Today callers must hand-roll cleanup (and the existing
_format_json_stronly handles newlines/tabs).
Solution
Two small, stdlib-only helpers added to
dbgpt.util.json_utils:repair_json(text)— best-effort cleanup that, in order, strips a surroundingcode fence, removes
//and/* */comments, and removes trailing commas before}/]. All transformations preserve the contents of double-quoted strings(escape-aware), so commas/slashes inside string values are never touched.
loads_robust(text, **kwargs)— tries strictjson.loadsfirst and only fallsback to parsing
repair_json(text)if that raisesJSONDecodeError. Well-formedinput parses identically to
json.loads;**kwargsare forwarded.Tests
Added 12 tests in
packages/dbgpt-core/src/dbgpt/util/tests/test_json_utils.pycovering trailing commas (object/array), fenced blocks (with/without language tag),
line/block comments, a combined case, string-content preservation, valid-JSON
passthrough, and the unrepairable-input raise path.
Compatibility
loads_robustonly diverges fromjson.loadswhen strict parsing already fails.