Team SurreyCTS 🦌 | University of Surrey, Centre for Translation Studies
A comprehensive framework for predicting lexical difficulty (GLMM scores) for L2 language learners. Built for the BEA 2026 Shared Task on Word Complexity Prediction.
┌─────────────────────────────────────────────────────────────┐
│ YAML Config │
│ (backbone, features, loss, hyperparameters, layers) │
└───────────────────────────┬─────────────────────────────────┘
│
┌──────▼──────┐
│ train.py │ ◄── Orchestrator
└──────┬──────┘
│
┌─────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌───────▼───────┐
│ data_utils │ │hybrid_model │ │ losses.py │
│ Text+Offsets│ │ HybridReg. │ │ MSE/Huber/CCC │
│ Features │ │ Entropy+MLP │ │ Log-Cosh │
└─────────────┘ └──────┬──────┘ └───────────────┘
│
┌──────▼──────┐
│model_registry│
│ Auto-sizing │
│ Validation │
└─────────────┘
| Mode | Config | Architecture | Use Case |
|---|---|---|---|
| Standard | use_hybrid_model: false |
AutoModelForSequenceClassification |
Any HF encoder, simple regression head |
| Hybrid | use_hybrid_model: true |
BEA2026HybridRegressor |
Deep MLP with entropy, string, and categorical features |
bea2026/
├── src/
│ ├── train.py # Main training script
│ ├── hybrid_model.py # BEA2026HybridRegressor
│ ├── data_utils.py # Data loading & feature engineering
│ ├── losses.py # Loss functions (MSE, Huber, Log-Cosh, CCC)
│ ├── model_registry.py # Model metadata & auto-configuration
│ ├── evaluate_test.py # Checkpoint evaluation
│ └── predict.py # Generate submission predictions
├── configs/
│ ├── sweeps_v2/ # V2 framework sweep configs
│ └── sweep/ # Legacy V1 sweep configs
├── data/ # Datasets (all.csv, en.csv, de.csv, cn.csv)
├── models/ # Local model checkpoints (e.g., COMET-22)
├── outputs/ # Training outputs & checkpoints
├── results/ # experiment_results.csv
├── run_sweeps_v2.sh # V2 sweep runner
├── requirements.txt
└── README.md
# Using conda (recommended)
conda activate lm_eval
# Or create a new environment
pip install -r requirements.txt
# Optional dependencies
pip install pypinyin # Chinese pinyin romanization
pip install peft>=0.7.0 # LoRA adapter supportpython src/train.py --config configs/sweeps_v2/rembert-huber.yamlchmod +x run_sweeps_v2.sh
./run_sweeps_v2.shAll experiments are defined via YAML configs. Below is a complete reference:
| Key | Type | Default | Description |
|---|---|---|---|
experiment_name |
str | config filename | Unique experiment identifier |
model_name |
str | google/rembert |
HuggingFace model or local path |
data_path |
str | required | Path to CSV dataset |
use_hybrid_model |
bool | false |
Enable hybrid architecture |
loss_function |
str | mse |
mse, huber, log_cosh, ccc |
loss_kwargs |
dict | {} |
Loss-specific params (e.g., delta: 1.0) |
| Key | Type | Default | Description |
|---|---|---|---|
learning_rate |
float | 2e-5 |
Optimizer learning rate |
train_batch_size |
int | 16 |
Per-device train batch size |
num_train_epochs |
float | 5 |
Number of training epochs |
weight_decay |
float | 0.01 |
AdamW weight decay |
max_length |
int | 256 |
Max tokenizer sequence length |
gradient_accumulation_steps |
int | 1 |
Gradient accumulation |
| Key | Default | Description |
|---|---|---|
deep_mlp |
false |
Use 4-layer MLP vs linear head |
use_attention |
true |
Enable attention-based features |
use_levenshtein |
false |
String distance feature |
use_normalized_levenshtein |
false |
Normalize by max length |
use_jaccard |
false |
Bigram Jaccard distance |
use_pinyin |
false |
Romanize Chinese before distance |
use_subword_ratio |
false |
Subword tokenization ratio |
use_cefr |
false |
CEFR level embedding |
use_lid |
false |
Language ID embedding |
prepend_lid |
false |
Prepend [L1: xx] to input |
| Key | Default | Description |
|---|---|---|
use_target_context_entropy |
false |
Target→Context attention entropy |
use_target_source_alignment |
false |
Target→Source attention mass |
use_target_source_entropy |
false |
Target→Source attention entropy |
use_context_target_entropy |
false |
Context→Target attention entropy |
use_multi_layer_entropy |
false |
Pool entropy from multiple layers |
use_global_cls_entropy |
false |
[CLS] attention entropy |
use_attention_variance |
false |
Head variance ( |
| Key | Default | Description |
|---|---|---|
multi_layer_funneling |
false |
Pool from multiple transformer layers |
funneling_layers |
auto | Explicit layer indices (e.g., [12, 20, 26, -1]) |
num_funneling_layers |
4 |
Auto-compute N evenly-spaced layers |
| Key | Default | Description |
|---|---|---|
use_lora |
false |
Enable LoRA adapters |
lora_r |
16 |
LoRA rank |
lora_alpha |
32 |
LoRA scaling factor |
lora_dropout |
0.1 |
LoRA dropout |
lora_target_modules |
["query", "value"] |
Target modules for LoRA |
Measures how "confused" the transformer is when attending between specific token groups:
- Target→Context (TC): High entropy = target word's attention is spread across many context tokens
- Target→Source (TSE): High entropy = weak cognate alignment signal
- Context→Target (CTE): High entropy = context finds the target word broadly salient
- Target→Source Alignment (TSA): Raw attention mass from target to source (magnitude signal)
- Global CLS Entropy: Dispersion of [CLS] token's attention across the sequence
Pools target word embeddings from multiple transformer depths to capture surface-level (early layers), syntactic (middle), and semantic (late) information simultaneously.
- Levenshtein Distance: Edit operations between L1 source and English target
- Jaccard Distance: Character bigram overlap
- Pinyin Romanization: Converts Chinese source words to pinyin before distance computation
Primary Metric: RMSE (used for model selection)
| Metric | Description |
|---|---|
| RMSE | Root Mean Squared Error (primary) |
| Pearson | Linear correlation |
| Spearman | Rank correlation |
| Kendall τ | Rank concordance |
| MAE | Mean Absolute Error |
Results are saved to:
results/experiment_results.csv— comprehensive CSV with all flagsoutputs/<name>/run_summary.json— per-run JSON- W&B — full config + metrics dashboard
| Model | Hidden | Layers | Type |
|---|---|---|---|
google/rembert ⭐ |
1152 | 32 | encoder |
xlm-roberta-base |
768 | 12 | encoder |
FacebookAI/xlm-roberta-large |
1024 | 24 | encoder |
microsoft/mdeberta-v3-base |
768 | 12 | encoder |
microsoft/infoxlm-large |
1024 | 24 | encoder |
models/comet-xlmr |
768 | 12 | encoder (local) |
The MLP head auto-scales based on backbone dimensions. Any HuggingFace AutoModel-compatible encoder can be used.
| Loss | Formula | Best For |
|---|---|---|
| MSE | (ŷ-y)² |
Standard baseline |
| Huber ⭐ | Quadratic near 0, linear beyond δ | Robust to GLMM outliers |
| Log-Cosh | log(cosh(ŷ-y)) |
Smooth MAE approximation |
| CCC | 1 - CCC(ŷ,y) |
Direct agreement optimization |
If you use this framework, please cite.
Team SurreyCTS — University of Surrey, Centre for Translation Studies