BEA 2026 Shared Task - Lexical Difficulty Prediction

Team SurreyCTS 🦌 | University of Surrey, Centre for Translation Studies

A comprehensive framework for predicting lexical difficulty (GLMM scores) for L2 language learners. Built for the BEA 2026 Shared Task on Word Complexity Prediction.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     YAML Config                              │
│  (backbone, features, loss, hyperparameters, layers)         │
└───────────────────────────┬─────────────────────────────────┘
                            │
                     ┌──────▼──────┐
                     │   train.py  │ ◄── Orchestrator
                     └──────┬──────┘
                            │
          ┌─────────────────┼──────────────────┐
          │                 │                  │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌───────▼───────┐
   │ data_utils  │  │hybrid_model │  │   losses.py   │
   │ Text+Offsets│  │ HybridReg.  │  │ MSE/Huber/CCC │
   │ Features    │  │ Entropy+MLP │  │ Log-Cosh      │
   └─────────────┘  └──────┬──────┘  └───────────────┘
                           │
                    ┌──────▼──────┐
                    │model_registry│
                    │ Auto-sizing  │
                    │ Validation   │
                    └─────────────┘

Two Operating Modes

Mode	Config	Architecture	Use Case
Standard	`use_hybrid_model: false`	`AutoModelForSequenceClassification`	Any HF encoder, simple regression head
Hybrid	`use_hybrid_model: true`	`BEA2026HybridRegressor`	Deep MLP with entropy, string, and categorical features

📁 Repository Structure

bea2026/
├── src/
│   ├── train.py              # Main training script
│   ├── hybrid_model.py       # BEA2026HybridRegressor
│   ├── data_utils.py         # Data loading & feature engineering
│   ├── losses.py             # Loss functions (MSE, Huber, Log-Cosh, CCC)
│   ├── model_registry.py     # Model metadata & auto-configuration
│   ├── evaluate_test.py      # Checkpoint evaluation
│   └── predict.py            # Generate submission predictions
├── configs/
│   ├── sweeps_v2/            # V2 framework sweep configs
│   └── sweep/                # Legacy V1 sweep configs
├── data/                     # Datasets (all.csv, en.csv, de.csv, cn.csv)
├── models/                   # Local model checkpoints (e.g., COMET-22)
├── outputs/                  # Training outputs & checkpoints
├── results/                  # experiment_results.csv
├── run_sweeps_v2.sh          # V2 sweep runner
├── requirements.txt
└── README.md

🚀 Quick Start

Setup

# Using conda (recommended)
conda activate lm_eval

# Or create a new environment
pip install -r requirements.txt

# Optional dependencies
pip install pypinyin      # Chinese pinyin romanization
pip install peft>=0.7.0   # LoRA adapter support

Run an Experiment

python src/train.py --config configs/sweeps_v2/rembert-huber.yaml

Run All Sweeps

chmod +x run_sweeps_v2.sh
./run_sweeps_v2.sh

⚙️ Configuration Reference

All experiments are defined via YAML configs. Below is a complete reference:

Core Settings

Key	Type	Default	Description
`experiment_name`	str	config filename	Unique experiment identifier
`model_name`	str	`google/rembert`	HuggingFace model or local path
`data_path`	str	required	Path to CSV dataset
`use_hybrid_model`	bool	`false`	Enable hybrid architecture
`loss_function`	str	`mse`	`mse`, `huber`, `log_cosh`, `ccc`
`loss_kwargs`	dict	`{}`	Loss-specific params (e.g., `delta: 1.0`)

Hyperparameters

Key	Type	Default	Description
`learning_rate`	float	`2e-5`	Optimizer learning rate
`train_batch_size`	int	`16`	Per-device train batch size
`num_train_epochs`	float	`5`	Number of training epochs
`weight_decay`	float	`0.01`	AdamW weight decay
`max_length`	int	`256`	Max tokenizer sequence length
`gradient_accumulation_steps`	int	`1`	Gradient accumulation

Feature Flags (Hybrid Model Only)

Key	Default	Description
`deep_mlp`	`false`	Use 4-layer MLP vs linear head
`use_attention`	`true`	Enable attention-based features
`use_levenshtein`	`false`	String distance feature
`use_normalized_levenshtein`	`false`	Normalize by max length
`use_jaccard`	`false`	Bigram Jaccard distance
`use_pinyin`	`false`	Romanize Chinese before distance
`use_subword_ratio`	`false`	Subword tokenization ratio
`use_cefr`	`false`	CEFR level embedding
`use_lid`	`false`	Language ID embedding
`prepend_lid`	`false`	Prepend `[L1: xx]` to input

Entropy Features

Key	Default	Description
`use_target_context_entropy`	`false`	Target→Context attention entropy
`use_target_source_alignment`	`false`	Target→Source attention mass
`use_target_source_entropy`	`false`	Target→Source attention entropy
`use_context_target_entropy`	`false`	Context→Target attention entropy
`use_multi_layer_entropy`	`false`	Pool entropy from multiple layers
`use_global_cls_entropy`	`false`	[CLS] attention entropy
`use_attention_variance`	`false`	Head variance (⚠️ hurts performance)

Layer Configuration

Key	Default	Description
`multi_layer_funneling`	`false`	Pool from multiple transformer layers
`funneling_layers`	auto	Explicit layer indices (e.g., `[12, 20, 26, -1]`)
`num_funneling_layers`	`4`	Auto-compute N evenly-spaced layers

LoRA (Optional)

Key	Default	Description
`use_lora`	`false`	Enable LoRA adapters
`lora_r`	`16`	LoRA rank
`lora_alpha`	`32`	LoRA scaling factor
`lora_dropout`	`0.1`	LoRA dropout
`lora_target_modules`	`["query", "value"]`	Target modules for LoRA

🧠 Feature Descriptions

Attention Entropy

Measures how "confused" the transformer is when attending between specific token groups:

Target→Context (TC): High entropy = target word's attention is spread across many context tokens
Target→Source (TSE): High entropy = weak cognate alignment signal
Context→Target (CTE): High entropy = context finds the target word broadly salient
Target→Source Alignment (TSA): Raw attention mass from target to source (magnitude signal)
Global CLS Entropy: Dispersion of [CLS] token's attention across the sequence

Semantic Funneling

Pools target word embeddings from multiple transformer depths to capture surface-level (early layers), syntactic (middle), and semantic (late) information simultaneously.

String Similarity

Levenshtein Distance: Edit operations between L1 source and English target
Jaccard Distance: Character bigram overlap
Pinyin Romanization: Converts Chinese source words to pinyin before distance computation

📊 Evaluation

Primary Metric: RMSE (used for model selection)

Metric	Description
RMSE	Root Mean Squared Error (primary)
Pearson	Linear correlation
Spearman	Rank correlation
Kendall τ	Rank concordance
MAE	Mean Absolute Error

Results are saved to:

results/experiment_results.csv — comprehensive CSV with all flags
outputs/<name>/run_summary.json — per-run JSON
W&B — full config + metrics dashboard

🏛️ Supported Models

Model	Hidden	Layers	Type
`google/rembert` ⭐	1152	32	encoder
`xlm-roberta-base`	768	12	encoder
`FacebookAI/xlm-roberta-large`	1024	24	encoder
`microsoft/mdeberta-v3-base`	768	12	encoder
`microsoft/infoxlm-large`	1024	24	encoder
`models/comet-xlmr`	768	12	encoder (local)

The MLP head auto-scales based on backbone dimensions. Any HuggingFace AutoModel-compatible encoder can be used.

📜 Loss Functions

Loss	Formula	Best For
MSE	`(ŷ-y)²`	Standard baseline
Huber ⭐	Quadratic near 0, linear beyond δ	Robust to GLMM outliers
Log-Cosh	`log(cosh(ŷ-y))`	Smooth MAE approximation
CCC	`1 - CCC(ŷ,y)`	Direct agreement optimization

📋 Citation

If you use this framework, please cite.

Team SurreyCTS — University of Surrey, Centre for Translation Studies

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bea2026st_official		bea2026st_official
configs		configs
data		data
models/comet-xlmr		models/comet-xlmr
notebooks		notebooks
predictions		predictions
results		results
scripts		scripts
shared_task_baseline		shared_task_baseline
src		src
submission/closed		submission/closed
.gitignore		.gitignore
README.md		README.md
SYSTEM_DOCUMENTATION.md		SYSTEM_DOCUMENTATION.md
cleanup.sh		cleanup.sh
final_checkpoint_analysis.md		final_checkpoint_analysis.md
make_top5_ensemble.py		make_top5_ensemble.py
requirements.txt		requirements.txt
run_final_sweeps.sh		run_final_sweeps.sh
run_sweeps.sh		run_sweeps.sh
run_sweeps_v2.sh		run_sweeps_v2.sh
submission.zip		submission.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BEA 2026 Shared Task - Lexical Difficulty Prediction

🏗️ Architecture

Two Operating Modes

📁 Repository Structure

🚀 Quick Start

Setup

Run an Experiment

Run All Sweeps

⚙️ Configuration Reference

Core Settings

Hyperparameters

Feature Flags (Hybrid Model Only)

Entropy Features

Layer Configuration

LoRA (Optional)

🧠 Feature Descriptions

Attention Entropy

Semantic Funneling

String Similarity

📊 Evaluation

🏛️ Supported Models

📜 Loss Functions

📋 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BEA 2026 Shared Task - Lexical Difficulty Prediction

🏗️ Architecture

Two Operating Modes

📁 Repository Structure

🚀 Quick Start

Setup

Run an Experiment

Run All Sweeps

⚙️ Configuration Reference

Core Settings

Hyperparameters

Feature Flags (Hybrid Model Only)

Entropy Features

Layer Configuration

LoRA (Optional)

🧠 Feature Descriptions

Attention Entropy

Semantic Funneling

String Similarity

📊 Evaluation

🏛️ Supported Models

📜 Loss Functions

📋 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages