Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61) by mathieu17g · Pull Request #1 · joshday/XML.jl

mathieu17g · 2026-06-03T20:10:57Z

Builds on the streaming discussion in JuliaData#61 — this is the event-level (cursor-based) StAX primitive proposed there, plus the Token change that makes it allocation-free. Branched on this PR's head (e532a28) and targeting joshday/XML.jl:main, so merging folds it into JuliaData#54; happy to retarget to JuliaComputing/XML.jl:main once JuliaData#54 lands.

JuliaData#54 removes next/prev, and FastKML.jl + XLSX.jl's sheet reader lean on next for forward streaming — a forward pull cursor is the replacement.

Two layers (5 commits):

Cursor — additive (new src/cursor.jl + exports + test/test_cursor.jl; no core changes): next!, for_each_child / @for_each_child, tag / value / attributes / get, the Cursor(data, startpos) / Cursor(LazyNode) bridges, and skip_element! (byte-level subtree skip). Merges on its own.
isbits Token — the core change (your call): raw::SubString → (kind, has_entities, offset, ncodeunits), so the (Token, TokenizerState) iterate tuple stops heap-allocating per token; raw(token, data) reconstructs the view. Representational — the scan logic is untouched.

The rationale, the allocation/perf numbers, and the FastKML-vs-ArchGDAL benchmark table are in JuliaData#61. Full XML.jl suite stays green and byte-identical to baseline.

Notes:

Forward-only (next!); no current consumer uses prev, so a backward prev! is a possible API-parity follow-up, not included here.

Adds `Cursor`: a single mutable wrapper advanced in place over the token stream (the cursor-based StAX direction from JuliaData#61). Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by mutating one object instead of materializing a node per child. Orthogonal/additive design: - New file src/cursor.jl; seams are one include + 4 exports in XML.jl. - `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer foundation. The cursor's accessors rest on the token-layer primitives (tag_name, attr_value, pi_target, unescape) — they never call LazyNode or its accessors, so DOM-layer changes don't affect the cursor. The token→value logic is intentionally duplicated rather than shared, to keep this purely additive (a later refactor can factor it out). API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof, get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way snapshot bridge for the aliasing contract (the cursor is reused in place; reads are synchronous-safe, retention requires a snapshot). Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity values, accessor agreement with LazyNode node-for-node, snapshot survival, iterator protocol. Full suite passes. Perf (N=100k synth, vs the lazy-walk techniques in JuliaData#61): Cursor next!() DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive ~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does not yet reach the v0.3.8+JuliaData#59 next!()-DFS class (57 ms / 123 MiB): the residual ~1 alloc/token is the non-isbits Token tuple at the iterate boundary, which a follow-up bitstype-Token change removes. Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@offset

…ocation Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range — `(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token` non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple returned by `iterate` is now isbits, so it returns in registers/sret with no heap allocation even though the tokenizer body is too large to inline. This removes the per-token allocation that was the cursor's residual cost (see JuliaData#61). Token API: - `raw(token, data) -> SubString` reconstructs the text view from the source. Multibyte-safe: it lands the end index on the START of the last char via `prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8 continuation byte as the end index and throws — verified on "aé"/"日本"). `_token_root` resolves `data::SubString` to its parent (offsets are root-relative). This matters for the UTF-16 path of JuliaData#62, whose fix transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte. - Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)` keep only the view's range, so all 22 tokenizer emit sites are unchanged. - `tag_name` / `attr_value` / `pi_target` now take `(token, data)`. - `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter (the buffered `pending` Token is non-parametric); `has_pending` tests `pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`. Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager _parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end` simplify to direct field access; `LazyAttrIterator` reaches the source via a small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl. xpath.jl needs no change (it uses a distinct `XPathToken` type). Tests: - Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its `using XML.XMLTokenizer` did not import the names so it could not run). Fixed imports, migrated all `.raw`/accessor sites to thread the source, updated the `show` test (no longer prints text), and wired it into runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the `raw()` round-trip in CI. - Full suite green, byte-identical to baseline: LazyNode 175/175, XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf + 195/940 not-wf (unchanged counts — Token is representational, the accept/reject scan logic is untouched). Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6): - Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc. - Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M, below the tech-4 target (57 ms/123 MiB), achieving JuliaData#61's memory goal. The residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per text node) — orthogonal, a separate monomorphization micro-opt. This modifies the core `Token` type, so it is NOT orthogonal/additive: it needs coordination with the maintainer and rebasing onto JuliaData#54 before any upstream merge. Develop in parallel on this stacked branch. Ref: JuliaData#61, JuliaData#62 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Nested for_each_child silently skipped a parent's second (and later) subtrees when the source had no inter-element whitespace (minified XML): the inner sweep broke on the boundary node by consuming it (next!() at the top of the loop), then the enclosing sweep's next!() advanced past that same node. Whitespace text nodes between elements accidentally masked the bug by serving as a throwaway boundary; minified machine-generated XML (common for KML) has none. Fix: make the cursor peekable via a `held` flag. On reaching the end of its subtree a sweep sets `c.held` instead of consuming the boundary node; the next `next!` re-yields the held node without advancing, so the enclosing sweep sees it. Composition is then correct for full DFS at any depth, independent of whitespace. Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS); full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754). This is a correctness fix for the Phase-1 cursor; it is committed here on the stacked bitstype-Token branch but logically belongs on feature-cursor — move or reorder when restructuring for the upstream PR stack. Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ch_child Support for driving the cursor from a known subtree position (Phase 3 wiring): - Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a byte offset instead of the document start — for walking a subtree whose start is known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable convenience over it (the only place Cursor mentions LazyNode), the inverse of the LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary. - @for_each_child c child body: macro form of for_each_child that INLINES the body (not a closure), so a body accumulating into enclosing locals avoids the capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk: 80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per mutated captured local. Mirrors why node-based code uses @for_each_immediate_child. 7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer 122, W3C 754). Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ral walks next!/for_each_child advance token-by-token, so a structural walk that classifies a node but doesn't need its contents still tokenizes every skipped subtree. skip_element! advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw, + _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI / quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter loop than full tokenization (no token emission, no SubString construction). Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too. Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr value, nested same-name, self-close, PI, minified) confirm skip lands exactly where for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175, XMLTokenizer 122, W3C 754). For structural walks like FastKML's layer discovery (the WRS-2 deficit). Ref: JuliaData#61 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-03T20:44:35Z

A note on CI: GitHub doesn't run Actions for a fork-based PR, so the checks area here stays empty. I ran the full CI workflow on this PR's exact head commit (059e494) from my fork — green on all 15 matrix jobs (Julia lts / 1 / nightly across Linux, Windows and macOS):

https://github.com/mathieu17g/XML.jl/actions/runs/26911360192

Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the current element (or `nothing` if it has attributes / isn't a single-text element). Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers still advance with `for_each_child` / `skip_element!`. Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`) with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path (building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory). Output byte-identical (checksum-verified). test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ttribute allocations by 36%

…nning

mathieu17g and others added 5 commits June 2, 2026 19:46

mathieu17g and others added 3 commits June 16, 2026 19:47

perf: replace Ref{Bool} with Bool in LazyAttrIterator, reducing eacha…

4d56ed3

…ttribute allocations by 36%

perf: add foreach_attr zero-allocation callback API for attribute sca…

3d5a806

…nning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1

Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1
mathieu17g wants to merge 8 commits into
joshday:mainfrom
mathieu17g:feature-cursor-bitstype-token

mathieu17g commented Jun 3, 2026

Uh oh!

mathieu17g commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mathieu17g commented Jun 3, 2026

Uh oh!

mathieu17g commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants