Skip to content

surrey-nlp/bea2026-surrey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEA 2026 Shared Task - Lexical Difficulty Prediction

Team SurreyCTS 🦌 | University of Surrey, Centre for Translation Studies

A comprehensive framework for predicting lexical difficulty (GLMM scores) for L2 language learners. Built for the BEA 2026 Shared Task on Word Complexity Prediction.


🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     YAML Config                              │
│  (backbone, features, loss, hyperparameters, layers)         │
└───────────────────────────┬─────────────────────────────────┘
                            │
                     ┌──────▼──────┐
                     │   train.py  │ ◄── Orchestrator
                     └──────┬──────┘
                            │
          ┌─────────────────┼──────────────────┐
          │                 │                  │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌───────▼───────┐
   │ data_utils  │  │hybrid_model │  │   losses.py   │
   │ Text+Offsets│  │ HybridReg.  │  │ MSE/Huber/CCC │
   │ Features    │  │ Entropy+MLP │  │ Log-Cosh      │
   └─────────────┘  └──────┬──────┘  └───────────────┘
                           │
                    ┌──────▼──────┐
                    │model_registry│
                    │ Auto-sizing  │
                    │ Validation   │
                    └─────────────┘

Two Operating Modes

Mode Config Architecture Use Case
Standard use_hybrid_model: false AutoModelForSequenceClassification Any HF encoder, simple regression head
Hybrid use_hybrid_model: true BEA2026HybridRegressor Deep MLP with entropy, string, and categorical features

📁 Repository Structure

bea2026/
├── src/
│   ├── train.py              # Main training script
│   ├── hybrid_model.py       # BEA2026HybridRegressor
│   ├── data_utils.py         # Data loading & feature engineering
│   ├── losses.py             # Loss functions (MSE, Huber, Log-Cosh, CCC)
│   ├── model_registry.py     # Model metadata & auto-configuration
│   ├── evaluate_test.py      # Checkpoint evaluation
│   └── predict.py            # Generate submission predictions
├── configs/
│   ├── sweeps_v2/            # V2 framework sweep configs
│   └── sweep/                # Legacy V1 sweep configs
├── data/                     # Datasets (all.csv, en.csv, de.csv, cn.csv)
├── models/                   # Local model checkpoints (e.g., COMET-22)
├── outputs/                  # Training outputs & checkpoints
├── results/                  # experiment_results.csv
├── run_sweeps_v2.sh          # V2 sweep runner
├── requirements.txt
└── README.md

🚀 Quick Start

Setup

# Using conda (recommended)
conda activate lm_eval

# Or create a new environment
pip install -r requirements.txt

# Optional dependencies
pip install pypinyin      # Chinese pinyin romanization
pip install peft>=0.7.0   # LoRA adapter support

Run an Experiment

python src/train.py --config configs/sweeps_v2/rembert-huber.yaml

Run All Sweeps

chmod +x run_sweeps_v2.sh
./run_sweeps_v2.sh

⚙️ Configuration Reference

All experiments are defined via YAML configs. Below is a complete reference:

Core Settings

Key Type Default Description
experiment_name str config filename Unique experiment identifier
model_name str google/rembert HuggingFace model or local path
data_path str required Path to CSV dataset
use_hybrid_model bool false Enable hybrid architecture
loss_function str mse mse, huber, log_cosh, ccc
loss_kwargs dict {} Loss-specific params (e.g., delta: 1.0)

Hyperparameters

Key Type Default Description
learning_rate float 2e-5 Optimizer learning rate
train_batch_size int 16 Per-device train batch size
num_train_epochs float 5 Number of training epochs
weight_decay float 0.01 AdamW weight decay
max_length int 256 Max tokenizer sequence length
gradient_accumulation_steps int 1 Gradient accumulation

Feature Flags (Hybrid Model Only)

Key Default Description
deep_mlp false Use 4-layer MLP vs linear head
use_attention true Enable attention-based features
use_levenshtein false String distance feature
use_normalized_levenshtein false Normalize by max length
use_jaccard false Bigram Jaccard distance
use_pinyin false Romanize Chinese before distance
use_subword_ratio false Subword tokenization ratio
use_cefr false CEFR level embedding
use_lid false Language ID embedding
prepend_lid false Prepend [L1: xx] to input

Entropy Features

Key Default Description
use_target_context_entropy false Target→Context attention entropy
use_target_source_alignment false Target→Source attention mass
use_target_source_entropy false Target→Source attention entropy
use_context_target_entropy false Context→Target attention entropy
use_multi_layer_entropy false Pool entropy from multiple layers
use_global_cls_entropy false [CLS] attention entropy
use_attention_variance false Head variance (⚠️ hurts performance)

Layer Configuration

Key Default Description
multi_layer_funneling false Pool from multiple transformer layers
funneling_layers auto Explicit layer indices (e.g., [12, 20, 26, -1])
num_funneling_layers 4 Auto-compute N evenly-spaced layers

LoRA (Optional)

Key Default Description
use_lora false Enable LoRA adapters
lora_r 16 LoRA rank
lora_alpha 32 LoRA scaling factor
lora_dropout 0.1 LoRA dropout
lora_target_modules ["query", "value"] Target modules for LoRA

🧠 Feature Descriptions

Attention Entropy

Measures how "confused" the transformer is when attending between specific token groups:

  • Target→Context (TC): High entropy = target word's attention is spread across many context tokens
  • Target→Source (TSE): High entropy = weak cognate alignment signal
  • Context→Target (CTE): High entropy = context finds the target word broadly salient
  • Target→Source Alignment (TSA): Raw attention mass from target to source (magnitude signal)
  • Global CLS Entropy: Dispersion of [CLS] token's attention across the sequence

Semantic Funneling

Pools target word embeddings from multiple transformer depths to capture surface-level (early layers), syntactic (middle), and semantic (late) information simultaneously.

String Similarity

  • Levenshtein Distance: Edit operations between L1 source and English target
  • Jaccard Distance: Character bigram overlap
  • Pinyin Romanization: Converts Chinese source words to pinyin before distance computation

📊 Evaluation

Primary Metric: RMSE (used for model selection)

Metric Description
RMSE Root Mean Squared Error (primary)
Pearson Linear correlation
Spearman Rank correlation
Kendall τ Rank concordance
MAE Mean Absolute Error

Results are saved to:

  • results/experiment_results.csv — comprehensive CSV with all flags
  • outputs/<name>/run_summary.json — per-run JSON
  • W&B — full config + metrics dashboard

🏛️ Supported Models

Model Hidden Layers Type
google/rembert 1152 32 encoder
xlm-roberta-base 768 12 encoder
FacebookAI/xlm-roberta-large 1024 24 encoder
microsoft/mdeberta-v3-base 768 12 encoder
microsoft/infoxlm-large 1024 24 encoder
models/comet-xlmr 768 12 encoder (local)

The MLP head auto-scales based on backbone dimensions. Any HuggingFace AutoModel-compatible encoder can be used.


📜 Loss Functions

Loss Formula Best For
MSE (ŷ-y)² Standard baseline
Huber Quadratic near 0, linear beyond δ Robust to GLMM outliers
Log-Cosh log(cosh(ŷ-y)) Smooth MAE approximation
CCC 1 - CCC(ŷ,y) Direct agreement optimization

📋 Citation

If you use this framework, please cite.


Team SurreyCTS — University of Surrey, Centre for Translation Studies

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors