verl-project · ehejin · Feb 27, 2026 · Mar 9, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/humanlm/README.md b/humanlm/README.md
@@ -0,0 +1,146 @@
+# Recipe: HumanLM
+
+Train user simulators by aligning on psychological state dimensions (belief, emotion, stance, value, goal, communication) instead of imitating response text.
+
+**Paper:** [HUMANLM: Simulating Users with State Alignment Beats Response Imitation](https://humanlm.stanford.edu/HumanLM_paper.pdf)
+
+**Project Page:** [https://humanlm.stanford.edu/](https://humanlm.stanford.edu/)
+
+## Environment Setup
+
+### Install Dependencies
+
+```bash
+# Install verl with trainer support
+pip install -e /path/to/verl  # verl repo with trainer module
+
+# Additional dependencies
+pip install litellm datasets polars
+```
+
+### Configure API Keys
+```bash
+# Required for LLM-as-judge rewards (RL training)
+export ANTHROPIC_API_KEY="your-key"
+export OPENAI_API_KEY="your-key"
+```
+
+---
+
+## Datasets
+
+Official HuggingFace datasets:
+
+| Dataset | HuggingFace Repo | `--dataset` arg |
+|---------|------------------|-----------------|
+| Humanual-Books | `snap-stanford/humanual-book` | `amazon` |
+| Humanual-Opinion | `snap-stanford/humanual-opinion` | `reddit` |
+| Humanual-Politics | `snap-stanford/humanual-politics` | `medium` |
+| Humanual-News | `snap-stanford/humanual-news` | `youtube` |
+| Humanual-Chat | `snap-stanford/humanual-chat` | `wildchat_english` |
+| Humanual-Email | `snap-stanford/humanual-email` | `enron` |
+
+---
+
+## SFT Training
+
+### Step 1: Process Dataset
+
+Convert HuggingFace dataset to SFT format:
+
+```bash
+# No-thinking mode (response only)
+python -m humanlm.process_dataset \
+    --dataset amazon \
+    --raw_dataset_repo snap-stanford/humanual-book \
+    --save_data_dir ./data/humanual-book \
+    --sft \
+    --no_tag
+
+# With thinking traces (requires API key for trace generation)
+python -m humanlm.process_dataset \
+    --dataset amazon \
+    --raw_dataset_repo snap-stanford/humanual-book \
+    --save_data_dir ./data/humanual-book \
+    --sft \
+    --thinking_sft \
+    --thinking_model gpt-4o-mini
+```
+
+This creates:
+```
+./data/humanual-book/
+└── sft/
+    └── r_no_tag/
+        ├── train.parquet
+        ├── val.parquet
+        └── test.parquet
+```
+
+### Step 2: Run SFT Training
+
+```bash
+DATASET_DIR=./data/humanual-book bash humanlm/train_sft_humanlm.sh \
+    "0,1,2,3,4,5,6,7" \
+    amazon \
+    no_thinking
+```
+
+**Arguments:**
+| Position | Name | Example | Description |
+|----------|------|---------|-------------|
+| 1 | GPU_LIST | `"0,1,2,3,4,5,6,7"` | Comma-separated GPU IDs |
+| 2 | DATASET_NAME | `amazon` | Dataset identifier |
+| 3 | THINKING_MODE | `no_thinking` or `thinking` | Whether to use thinking traces |
+
+**Environment Variables:**
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DATASET_DIR` | (required) | Path to processed data |
+| `OUTPUT_ROOT` | `./outputs` | Where to save checkpoints |
+| `HF_CACHE_DIR` | system default | HuggingFace cache location |
+
+**Output:**
+- Model checkpoints: `./outputs/sft_amazon_no_thinking_r_no_tag/`
+- WandB project: `humanlm`
+
+---
+
+## RL Training (GRPO)
+
+Before training, update ```cluster_config.sh``` with your custom project paths and your .env file.  
+
+### Train HumanLM
+```bash
+bash humanlm/train_rl_humanlm.sh \
+    "0,1,2,3,4,5,6,7" \
+    amazon \
+    train_humanlm \
+    "" \
+    base
+```
+
+### Evaluation
+```bash
+bash humanlm/train_rl_humanlm.sh \
+    "0,1,2,3,4,5,6,7" \
+    amazon \
+    eval_only \
+    "/path/to/checkpoint" \
+    humanlm
+```
+
+---
+
+## Citation
+
+```bibtex
+@article{wu2026humanlm,
+  title={HUMANLM: Simulating Users with State Alignment Beats Response Imitation},
+  url={https://humanlm.stanford.edu/},
+  author={Wu, Shirley and Choi, Evelyn and Khatua, Arpandeep and
+          Wang, Zhanghan and He-Yueya, Joy and Weerasooriya, Tharindu Cyril and
+          Wei, Wei and Yang, Diyi and Leskovec, Jure and Zou, James},
+  year={2026}
+}
+```
diff --git a/humanlm/chat_templates/qwen3_multi_role_template_think.jinja b/humanlm/chat_templates/qwen3_multi_role_template_think.jinja
@@ -0,0 +1,101 @@
+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.name %}<name>{{ message.name }}</name>
+        {%- endif -%}
+        {{- '\n' + message.content | trim + '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.name %}<name>{{ message.name }}</name>
+            {{- '\n' }}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- content }}
+            {%- endif %}
+        {%- else %}
+            {{- content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>user\n' }}
+    {%- if speak_as is defined and speak_as %}<name>{{ speak_as }}</name>
+    {%- endif -%}
+    {{- '\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- elif enable_thinking is defined and enable_thinking is true %}
+        {{- '<think>' }}
+    {%- endif %}
+{%- endif %}
diff --git a/humanlm/cluster_config.sh b/humanlm/cluster_config.sh
@@ -0,0 +1,18 @@
+# This should have your llm judge api keys, as well as wandb api key
+export ENV_FILE="$PROJECT_DIR/.env"
+
+# Paths for RL training
+export PROJECT_DIR="/path/to/shared/project"
+export SCRATCH_DIR="/path/to/your/scratch/$USER"
+
+export DATASET_DIR="$PROJECT_DIR/llm_twin/processed_data"
+export MODEL_PATH="$PROJECT_DIR/llm_twin/models/Qwen3-8B"
+export CACHE_DIR="$PROJECT_DIR/llm_twin/verl_cache"
+export OUTPUT_DIR="$SCRATCH_DIR/humanlm_outputs/$EXP_NAME"
+
+# Set Cache directories
+export HF_HOME="$SCRATCH_DIR/hf"
+export HF_DATASETS_CACHE="$HF_HOME/datasets"
+export TRANSFORMERS_CACHE="$HF_HOME/transformers"
+export HUGGINGFACE_HUB_CACHE="$HF_HOME/hub"
+export XDG_CACHE_HOME="$HF_HOME/xdg"
diff --git a/humanlm/configs/humanlm_agent_loop_config.yaml b/humanlm/configs/humanlm_agent_loop_config.yaml
@@ -0,0 +1,2 @@
+- name: humanlm_agent
+  _target_: recipe.humanlm.humanlm_agent_loop.HumanLMAgentLoop
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		- name: humanlm_agent
		_target_: recipe.humanlm.humanlm_agent_loop.HumanLMAgentLoop