Repro
printf '{\xF0\xB1\xA1\x80' | xxd
# 00000000: 7bf0 b1a1 80 {....
Five bytes: open brace { (0x7B) followed by the 4-byte UTF-8 encoding of U+31860 (CJK Unified Ideograph Extension G).
Parsing this input in Rust via tree-sitter + tree-sitter-bash = 0.25.1 on Linux x86_64 (Ubuntu-latest GitHub Actions) crashes the process with SIGSEGV (exit code 139). The same bytes parse cleanly on macOS aarch64 as a normal tree.root_node().has_error() == true state.
Minimal Rust reproducer
use tree_sitter::Parser;
fn main() {
let mut p = Parser::new();
p.set_language(&tree_sitter_bash::LANGUAGE.into()).unwrap();
let input = "{\u{31860}"; // 5 bytes
let _ = p.parse(input, None);
println!("did not crash");
}
Output on Linux: process exits with signal 11 (SIGSEGV) before println! runs. Same crate versions (tree-sitter = 0.25, tree-sitter-bash = 0.25.1) on macOS succeed and print did not crash.
Discovery path
Found by property-based fuzzing of a classifier that wraps tree_sitter::Parser::parse. The fuzzer generated arbitrary UTF-8 inputs up to 2000 chars; ~1 in every 60 inputs tripped a Linux-only SIGSEGV. Byte-grained bisect of one captured 2863-byte crasher narrowed the triggering bytes to positions 2485-2489 — exactly { + U+31860. A separate classifier sweep confirmed:
| Input |
Linux result |
{ + U+31860 |
SIGSEGV |
{ + U+10000 (Gothic A) |
clean error state |
{ + U+1F600 (😀) |
clean error state |
{ + U+FFFD (BMP) |
clean error state |
{ + U+4E00 (CJK BMP) |
clean error state |
( + U+31860 |
clean error state |
[ + U+31860 |
clean error state |
" + U+31860 |
clean error state |
+ U+31860 |
parses as word (Ok) |
a + U+31860 |
parses as word (Ok) |
| U+31860 alone |
parses as word (Ok) |
So the crash is very narrow: specifically the pair { + U+31860 (adjacent, no separator). Not brace + any astral codepoint, not U+31860 in any context — both required.
Hypothesis
Given the specificity, my guess is a lookup into a generated parser table (or external scanner state) that overruns a bounds check only on this specific token pair. The input is 5 bytes; nothing upstream has had a chance to accumulate state. The macOS/Linux divergence is probably ASLR/memory layout luck — macOS happens to put a readable page where Linux has nothing mapped.
Barbican context
This is a bug in a classifier shipped as a safety layer for AI-generated shell commands; a SIGSEGV in the classifier is a crash in the safety floor. I've worked around it locally with a 10-line pre-flight scan that denies {<U+31860> before calling tree_sitter::Parser::parse, but that doesn't scale if there are other dangerous pairings in the parser table, so a proper upstream fix is the real resolution.
Happy to test a patched branch if useful.
Repro
Five bytes: open brace
{(0x7B) followed by the 4-byte UTF-8 encoding of U+31860 (CJK Unified Ideograph Extension G).Parsing this input in Rust via
tree-sitter+tree-sitter-bash = 0.25.1on Linux x86_64 (Ubuntu-latest GitHub Actions) crashes the process with SIGSEGV (exit code 139). The same bytes parse cleanly on macOS aarch64 as a normaltree.root_node().has_error() == truestate.Minimal Rust reproducer
Output on Linux: process exits with signal 11 (SIGSEGV) before
println!runs. Same crate versions (tree-sitter = 0.25,tree-sitter-bash = 0.25.1) on macOS succeed and printdid not crash.Discovery path
Found by property-based fuzzing of a classifier that wraps
tree_sitter::Parser::parse. The fuzzer generated arbitrary UTF-8 inputs up to 2000 chars; ~1 in every 60 inputs tripped a Linux-only SIGSEGV. Byte-grained bisect of one captured 2863-byte crasher narrowed the triggering bytes to positions 2485-2489 — exactly{+ U+31860. A separate classifier sweep confirmed:{+ U+31860{+ U+10000 (Gothic A){+ U+1F600 (😀){+ U+FFFD (BMP){+ U+4E00 (CJK BMP)(+ U+31860[+ U+31860"+ U+31860+ U+31860Ok)a+ U+31860Ok)Ok)So the crash is very narrow: specifically the pair
{+ U+31860 (adjacent, no separator). Not brace + any astral codepoint, not U+31860 in any context — both required.Hypothesis
Given the specificity, my guess is a lookup into a generated parser table (or external scanner state) that overruns a bounds check only on this specific token pair. The input is 5 bytes; nothing upstream has had a chance to accumulate state. The macOS/Linux divergence is probably ASLR/memory layout luck — macOS happens to put a readable page where Linux has nothing mapped.
Barbican context
This is a bug in a classifier shipped as a safety layer for AI-generated shell commands; a SIGSEGV in the classifier is a crash in the safety floor. I've worked around it locally with a 10-line pre-flight scan that denies
{<U+31860>before callingtree_sitter::Parser::parse, but that doesn't scale if there are other dangerous pairings in the parser table, so a proper upstream fix is the real resolution.Happy to test a patched branch if useful.