Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1
Open
mathieu17g wants to merge 8 commits into
Open
Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1mathieu17g wants to merge 8 commits into
mathieu17g wants to merge 8 commits into
Conversation
Adds `Cursor`: a single mutable wrapper advanced in place over the token stream (the cursor-based StAX direction from JuliaData#61). Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by mutating one object instead of materializing a node per child. Orthogonal/additive design: - New file src/cursor.jl; seams are one include + 4 exports in XML.jl. - `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer foundation. The cursor's accessors rest on the token-layer primitives (tag_name, attr_value, pi_target, unescape) — they never call LazyNode or its accessors, so DOM-layer changes don't affect the cursor. The token→value logic is intentionally duplicated rather than shared, to keep this purely additive (a later refactor can factor it out). API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof, get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way snapshot bridge for the aliasing contract (the cursor is reused in place; reads are synchronous-safe, retention requires a snapshot). Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity values, accessor agreement with LazyNode node-for-node, snapshot survival, iterator protocol. Full suite passes. Perf (N=100k synth, vs the lazy-walk techniques in JuliaData#61): Cursor next!() DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive ~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does not yet reach the v0.3.8+JuliaData#59 next!()-DFS class (57 ms / 123 MiB): the residual ~1 alloc/token is the non-isbits Token tuple at the iterate boundary, which a follow-up bitstype-Token change removes. Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocation
Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range —
`(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token`
non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple
returned by `iterate` is now isbits, so it returns in registers/sret with no
heap allocation even though the tokenizer body is too large to inline. This
removes the per-token allocation that was the cursor's residual cost (see JuliaData#61).
Token API:
- `raw(token, data) -> SubString` reconstructs the text view from the source.
Multibyte-safe: it lands the end index on the START of the last char via
`prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8
continuation byte as the end index and throws — verified on "aé"/"日本").
`_token_root` resolves `data::SubString` to its parent (offsets are
root-relative). This matters for the UTF-16 path of JuliaData#62, whose fix
transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte.
- Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)`
keep only the view's range, so all 22 tokenizer emit sites are unchanged.
- `tag_name` / `attr_value` / `pi_target` now take `(token, data)`.
- `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter
(the buffered `pending` Token is non-parametric); `has_pending` tests
`pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`.
Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager
_parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end`
simplify to direct field access; `LazyAttrIterator` reaches the source via a
small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl.
xpath.jl needs no change (it uses a distinct `XPathToken` type).
Tests:
- Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its
`using XML.XMLTokenizer` did not import the names so it could not run).
Fixed imports, migrated all `.raw`/accessor sites to thread the source,
updated the `show` test (no longer prints text), and wired it into
runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the
`raw()` round-trip in CI.
- Full suite green, byte-identical to baseline: LazyNode 175/175,
XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf +
195/940 not-wf (unchanged counts — Token is representational, the
accept/reject scan logic is untouched).
Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6):
- Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc.
- Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M,
below the tech-4 target (57 ms/123 MiB), achieving JuliaData#61's memory goal. The
residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per
text node) — orthogonal, a separate monomorphization micro-opt.
This modifies the core `Token` type, so it is NOT orthogonal/additive: it
needs coordination with the maintainer and rebasing onto JuliaData#54 before any
upstream merge. Develop in parallel on this stacked branch.
Ref: JuliaData#61, JuliaData#62
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Nested for_each_child silently skipped a parent's second (and later) subtrees when the source had no inter-element whitespace (minified XML): the inner sweep broke on the boundary node by consuming it (next!() at the top of the loop), then the enclosing sweep's next!() advanced past that same node. Whitespace text nodes between elements accidentally masked the bug by serving as a throwaway boundary; minified machine-generated XML (common for KML) has none. Fix: make the cursor peekable via a `held` flag. On reaching the end of its subtree a sweep sets `c.held` instead of consuming the boundary node; the next `next!` re-yields the held node without advancing, so the enclosing sweep sees it. Composition is then correct for full DFS at any depth, independent of whitespace. Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS); full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754). This is a correctness fix for the Phase-1 cursor; it is committed here on the stacked bitstype-Token branch but logically belongs on feature-cursor — move or reorder when restructuring for the upstream PR stack. Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch_child Support for driving the cursor from a known subtree position (Phase 3 wiring): - Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a byte offset instead of the document start — for walking a subtree whose start is known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable convenience over it (the only place Cursor mentions LazyNode), the inverse of the LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary. - @for_each_child c child body: macro form of for_each_child that INLINES the body (not a closure), so a body accumulating into enclosing locals avoids the capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk: 80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per mutated captured local. Mirrors why node-based code uses @for_each_immediate_child. 7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer 122, W3C 754). Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ral walks next!/for_each_child advance token-by-token, so a structural walk that classifies a node but doesn't need its contents still tokenizes every skipped subtree. skip_element! advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw, + _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI / quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter loop than full tokenization (no token emission, no SubString construction). Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too. Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr value, nested same-name, self-close, PI, minified) confirm skip lands exactly where for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175, XMLTokenizer 122, W3C 754). For structural walks like FastKML's layer discovery (the WRS-2 deficit). Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
A note on CI: GitHub doesn't run Actions for a fork-based PR, so the checks area here stays empty. I ran the full CI workflow on this PR's exact head commit (059e494) from my fork — green on all 15 matrix jobs (Julia lts / 1 / nightly across Linux, Windows and macOS): https://github.com/mathieu17g/XML.jl/actions/runs/26911360192 |
Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the current element (or `nothing` if it has attributes / isn't a single-text element). Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers still advance with `for_each_child` / `skip_element!`. Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`) with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path (building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory). Output byte-identical (checksum-verified). test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttribute allocations by 36%
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the streaming discussion in JuliaData#61 — this is the event-level (cursor-based) StAX primitive proposed there, plus the
Tokenchange that makes it allocation-free. Branched on this PR's head (e532a28) and targetingjoshday/XML.jl:main, so merging folds it into JuliaData#54; happy to retarget toJuliaComputing/XML.jl:mainonce JuliaData#54 lands.JuliaData#54 removes
next/prev, and FastKML.jl + XLSX.jl's sheet reader lean onnextfor forward streaming — a forward pull cursor is the replacement.Two layers (5 commits):
Cursor— additive (newsrc/cursor.jl+ exports +test/test_cursor.jl; no core changes):next!,for_each_child/@for_each_child,tag/value/attributes/get, theCursor(data, startpos)/Cursor(LazyNode)bridges, andskip_element!(byte-level subtree skip). Merges on its own.Token— the core change (your call):raw::SubString→(kind, has_entities, offset, ncodeunits), so the(Token, TokenizerState)iterate tuple stops heap-allocating per token;raw(token, data)reconstructs the view. Representational — the scan logic is untouched.The rationale, the allocation/perf numbers, and the FastKML-vs-ArchGDAL benchmark table are in JuliaData#61. Full XML.jl suite stays green and byte-identical to baseline.
Notes:
next!); no current consumer usesprev, so a backwardprev!is a possible API-parity follow-up, not included here.