Skip to content

Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1

Open
mathieu17g wants to merge 8 commits into
joshday:mainfrom
mathieu17g:feature-cursor-bitstype-token
Open

Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61)#1
mathieu17g wants to merge 8 commits into
joshday:mainfrom
mathieu17g:feature-cursor-bitstype-token

Conversation

@mathieu17g

Copy link
Copy Markdown

Builds on the streaming discussion in JuliaData#61 — this is the event-level (cursor-based) StAX primitive proposed there, plus the Token change that makes it allocation-free. Branched on this PR's head (e532a28) and targeting joshday/XML.jl:main, so merging folds it into JuliaData#54; happy to retarget to JuliaComputing/XML.jl:main once JuliaData#54 lands.

JuliaData#54 removes next/prev, and FastKML.jl + XLSX.jl's sheet reader lean on next for forward streaming — a forward pull cursor is the replacement.

Two layers (5 commits):

  • Cursor — additive (new src/cursor.jl + exports + test/test_cursor.jl; no core changes): next!, for_each_child / @for_each_child, tag / value / attributes / get, the Cursor(data, startpos) / Cursor(LazyNode) bridges, and skip_element! (byte-level subtree skip). Merges on its own.
  • isbits Token — the core change (your call): raw::SubString(kind, has_entities, offset, ncodeunits), so the (Token, TokenizerState) iterate tuple stops heap-allocating per token; raw(token, data) reconstructs the view. Representational — the scan logic is untouched.

The rationale, the allocation/perf numbers, and the FastKML-vs-ArchGDAL benchmark table are in JuliaData#61. Full XML.jl suite stays green and byte-identical to baseline.

Notes:

  • Forward-only (next!); no current consumer uses prev, so a backward prev! is a possible API-parity follow-up, not included here.

mathieu17g and others added 5 commits June 2, 2026 19:46
Adds `Cursor`: a single mutable wrapper advanced in place over the token
stream (the cursor-based StAX direction from JuliaData#61).
Closes the per-child `LazyNode` allocation gap of the lazy DOM walk by
mutating one object instead of materializing a node per child.

Orthogonal/additive design:
- New file src/cursor.jl; seams are one include + 4 exports in XML.jl.
- `Cursor` and `LazyNode` are siblings on the shared XMLTokenizer
  foundation. The cursor's accessors rest on the token-layer primitives
  (tag_name, attr_value, pi_target, unescape) — they never call LazyNode
  or its accessors, so DOM-layer changes don't affect the cursor. The
  token→value logic is intentionally duplicated rather than shared, to
  keep this purely additive (a later refactor can factor it out).

API: next!, for_each_child, nodetype/tag/value/attributes/depth/eof,
get, the Base.iterate pull-mode surface, and LazyNode(c) as a one-way
snapshot bridge for the aliasing contract (the cursor is reused in
place; reads are synchronous-safe, retention requires a snapshot).

Tests: test/test_cursor.jl (46 cases) — depth model on hand-counted
docs, for_each_child, attributes/get, CData/Comment/PI/DTD/entity
values, accessor agreement with LazyNode node-for-node, snapshot
survival, iterator protocol. Full suite passes.

Perf (N=100k synth, vs the lazy-walk techniques in JuliaData#61): Cursor next!()
DFS = 103 ms / 305 MiB / 4.0M allocs, vs v0.4 eachchildnode/recursive
~310-390 ms / ~1 GiB / 12-15M (×3 faster, ×3.4 less memory). It does
not yet reach the v0.3.8+JuliaData#59 next!()-DFS class (57 ms / 123 MiB): the
residual ~1 alloc/token is the non-isbits Token tuple at the iterate
boundary, which a follow-up bitstype-Token change removes.

Ref: JuliaData#61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocation

Replaces `Token{S}`'s `raw::SubString{S}` field with a plain byte range —
`(kind, has_entities, offset::Int, ncodeunits::Int)` — making `Token`
non-parametric and isbits (24 bytes). The `(Token, TokenizerState)` tuple
returned by `iterate` is now isbits, so it returns in registers/sret with no
heap allocation even though the tokenizer body is too large to inline. This
removes the per-token allocation that was the cursor's residual cost (see JuliaData#61).

Token API:
- `raw(token, data) -> SubString` reconstructs the text view from the source.
  Multibyte-safe: it lands the end index on the START of the last char via
  `prevind` (a naive `SubString(data, off+1, off+ncu)` passes a UTF-8
  continuation byte as the end index and throws — verified on "aé"/"日本").
  `_token_root` resolves `data::SubString` to its parent (offsets are
  root-relative). This matters for the UTF-16 path of JuliaData#62, whose fix
  transcodes to a UTF-8 String upstream of the tokenizer → dense multibyte.
- Emit-site constructors `Token(kind, view)` / `Token(kind, has_amp, view)`
  keep only the view's range, so all 22 tokenizer emit sites are unchanged.
- `tag_name` / `attr_value` / `pi_target` now take `(token, data)`.
- `TokenizerState` and `StatefulTokenizer.state` drop the `{S}` parameter
  (the buffered `pending` Token is non-parametric); `has_pending` tests
  `pending.ncodeunits != 0`; `show(::Token)` prints `KIND @offset+len`.

Consumers thread `data` (`tok.raw` → `raw(tok, data)`): src/XML.jl (eager
_parse), src/lazynode.jl (LazyNode + iterators; `_lazy_pos`/`_token_end`
simplify to direct field access; `LazyAttrIterator` reaches the source via a
small `_src(iter)` helper since it carries only the tokenizer), src/cursor.jl.
xpath.jl needs no change (it uses a distinct `XPathToken` type).

Tests:
- Revives test/test_tokenizer.jl (was orphaned — not in runtests, and its
  `using XML.XMLTokenizer` did not import the names so it could not run).
  Fixed imports, migrated all `.raw`/accessor sites to thread the source,
  updated the `show` test (no longer prints text), and wired it into
  runtests.jl. Its multibyte cases (café/über/héllo/日本語) now guard the
  `raw()` round-trip in CI.
- Full suite green, byte-identical to baseline: LazyNode 175/175,
  XMLTokenizer 122/122, Cursor 46/46, XPath 66/66, W3C 559/577 wf +
  195/940 not-wf (unchanged counts — Token is representational, the
  accept/reject scan logic is untouched).

Measured (N=100k synth placemarks, @benchmark seconds=3, Julia 1.12.6):
- Cursor advance-only: 305 MiB/4.0M allocs → 0.00 MiB / 1 alloc.
- Cursor full value-extraction: 103 ms/305 MiB → 83 ms / 30.5 MiB / 1.0M,
  below the tech-4 target (57 ms/123 MiB), achieving JuliaData#61's memory goal. The
  residual 30 MiB is the `value()::Union{SubString,String}` boxing (one per
  text node) — orthogonal, a separate monomorphization micro-opt.

This modifies the core `Token` type, so it is NOT orthogonal/additive: it
needs coordination with the maintainer and rebasing onto JuliaData#54 before any
upstream merge. Develop in parallel on this stacked branch.

Ref: JuliaData#61, JuliaData#62

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Nested for_each_child silently skipped a parent's second (and later) subtrees
when the source had no inter-element whitespace (minified XML): the inner sweep
broke on the boundary node by consuming it (next!() at the top of the loop),
then the enclosing sweep's next!() advanced past that same node. Whitespace
text nodes between elements accidentally masked the bug by serving as a
throwaway boundary; minified machine-generated XML (common for KML) has none.

Fix: make the cursor peekable via a `held` flag. On reaching the end of its
subtree a sweep sets `c.held` instead of consuming the boundary node; the next
`next!` re-yields the held node without advancing, so the enclosing sweep sees
it. Composition is then correct for full DFS at any depth, independent of
whitespace.

Verified by 3 new test_cursor cases (minified + whitespaced + 3-level DFS);
full suite green (Cursor 49, LazyNode 175, XMLTokenizer 122, W3C 754).

This is a correctness fix for the Phase-1 cursor; it is committed here on the
stacked bitstype-Token branch but logically belongs on feature-cursor — move
or reorder when restructuring for the upstream PR stack.

Ref: JuliaData#61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ch_child

Support for driving the cursor from a known subtree position (Phase 3 wiring):

- Cursor(data, startpos::Integer): primitive cursor whose token stream starts at a
  byte offset instead of the document start — for walking a subtree whose start is
  known. LazyNode-agnostic. Cursor(node::LazyNode) becomes a thin, removable
  convenience over it (the only place Cursor mentions LazyNode), the inverse of the
  LazyNode(c) snapshot. for_each_child auto-stops at the subtree boundary.

- @for_each_child c child body: macro form of for_each_child that INLINES the body
  (not a closure), so a body accumulating into enclosing locals avoids the
  capture-boxing a do-block incurs. Measured on a 5k-placemark accumulating walk:
  80 B (macro) vs 237 KB (for_each_child do-block) — the latter is one Core.Box per
  mutated captured local. Mirrors why node-based code uses @for_each_immediate_child.

7 new test_cursor cases (subtree bridge via offset + LazyNode; inlined nested
accumulation, minified); full suite green (Cursor 56, LazyNode 175, XMLTokenizer
122, W3C 754).

Ref: JuliaData#61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ral walks

next!/for_each_child advance token-by-token, so a structural walk that classifies a
node but doesn't need its contents still tokenizes every skipped subtree. skip_element!
advances past an element's whole subtree in one byte scan (XMLTokenizer._skip_element_raw,
+ _scan_tag_end): counts element-nesting depth and respects CDATA / comment / PI /
quoted-`>` boundaries, emitting no internal tokens. O(subtree-bytes) but a far tighter
loop than full tokenization (no token emission, no SubString construction).

Measured (WRS-2 Document, 28k flat Placemarks): classify WITH skip 21 ms vs 70 ms
tokenizing the subtrees — ×3.4, and faster than the v0.3.8 next!() walk (~32 ms) too.
Robust: 16 new test_cursor cases (literal </tag> in CDATA/comments, > inside an attr
value, nested same-name, self-close, PI, minified) confirm skip lands exactly where
for_each_child's full walk does. Full suite green (Cursor 72, LazyNode 175,
XMLTokenizer 122, W3C 754).

For structural walks like FastKML's layer discovery (the WRS-2 deficit).
Ref: JuliaData#61

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Author

A note on CI: GitHub doesn't run Actions for a fork-based PR, so the checks area here stays empty. I ran the full CI workflow on this PR's exact head commit (059e494) from my fork — green on all 15 matrix jobs (Julia lts / 1 / nightly across Linux, Windows and macOS):

https://github.com/mathieu17g/XML.jl/actions/runs/26911360192

mathieu17g and others added 3 commits June 16, 2026 19:47
Cursor mirror of `is_simple_value(::LazyNode)`: returns the lone Text/CData value of the
current element (or `nothing` if it has attributes / isn't a single-text element).
Non-destructive — reads via `_rescan`, so the cursor position is unchanged and callers
still advance with `for_each_child` / `skip_element!`.

Lets hot streaming paths read a single-text element's value (e.g. an XLSX cell's `<v>`)
with no per-element `LazyNode` snapshot. Measured downstream on XLSX.jl's read path
(building `Cell` from the cursor instead of a per-cell `LazyNode`): readtable/eachrow on
numeric_only & dates_heavy drop ~40% allocations / ~35% memory, taking the v0.4 read
regression vs EzXML v0.10.4 from +15–18% back to ~parity (and below v0.10.4 in memory).
Output byte-identical (checksum-verified).

test/test_cursor.jl: +1 testset (matches LazyNode on text/entity/CDATA; `nothing` for
attrs/element-child/empty/mixed/non-element; non-destructive). Cursor suite 72 → 87.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants