Skip to content

SIGSEGV on parse of 5-byte input '{𱡀' (U+31860 after open brace) on Linux #337

@jdidion

Description

@jdidion

Repro

printf '{\xF0\xB1\xA1\x80' | xxd
# 00000000: 7bf0 b1a1 80                             {....

Five bytes: open brace { (0x7B) followed by the 4-byte UTF-8 encoding of U+31860 (CJK Unified Ideograph Extension G).

Parsing this input in Rust via tree-sitter + tree-sitter-bash = 0.25.1 on Linux x86_64 (Ubuntu-latest GitHub Actions) crashes the process with SIGSEGV (exit code 139). The same bytes parse cleanly on macOS aarch64 as a normal tree.root_node().has_error() == true state.

Minimal Rust reproducer

use tree_sitter::Parser;

fn main() {
    let mut p = Parser::new();
    p.set_language(&tree_sitter_bash::LANGUAGE.into()).unwrap();
    let input = "{\u{31860}";  // 5 bytes
    let _ = p.parse(input, None);
    println!("did not crash");
}

Output on Linux: process exits with signal 11 (SIGSEGV) before println! runs. Same crate versions (tree-sitter = 0.25, tree-sitter-bash = 0.25.1) on macOS succeed and print did not crash.

Discovery path

Found by property-based fuzzing of a classifier that wraps tree_sitter::Parser::parse. The fuzzer generated arbitrary UTF-8 inputs up to 2000 chars; ~1 in every 60 inputs tripped a Linux-only SIGSEGV. Byte-grained bisect of one captured 2863-byte crasher narrowed the triggering bytes to positions 2485-2489 — exactly { + U+31860. A separate classifier sweep confirmed:

Input Linux result
{ + U+31860 SIGSEGV
{ + U+10000 (Gothic A) clean error state
{ + U+1F600 (😀) clean error state
{ + U+FFFD (BMP) clean error state
{ + U+4E00 (CJK BMP) clean error state
( + U+31860 clean error state
[ + U+31860 clean error state
" + U+31860 clean error state
+ U+31860 parses as word (Ok)
a + U+31860 parses as word (Ok)
U+31860 alone parses as word (Ok)

So the crash is very narrow: specifically the pair { + U+31860 (adjacent, no separator). Not brace + any astral codepoint, not U+31860 in any context — both required.

Hypothesis

Given the specificity, my guess is a lookup into a generated parser table (or external scanner state) that overruns a bounds check only on this specific token pair. The input is 5 bytes; nothing upstream has had a chance to accumulate state. The macOS/Linux divergence is probably ASLR/memory layout luck — macOS happens to put a readable page where Linux has nothing mapped.

Barbican context

This is a bug in a classifier shipped as a safety layer for AI-generated shell commands; a SIGSEGV in the classifier is a crash in the safety floor. I've worked around it locally with a 10-line pre-flight scan that denies {<U+31860> before calling tree_sitter::Parser::parse, but that doesn't scale if there are other dangerous pairings in the parser table, so a proper upstream fix is the real resolution.

Happy to test a patched branch if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions