[recipe] feat: HumanLM integration for user simulation by ehejin · Pull Request #63 · verl-project/verl-recipe

ehejin · 2026-03-11T06:02:15Z

What does this PR do?

This PR adds HumanLM as a training recipe. It includes the following:

A customized HumanLMAgentLoop inheriting from AgentLoopBase that enables chain-of-thought only during response generation and applies n-gram repetition penalty
A customized HumanLMAgentLoopWorker that extends AgentLoopWorker to enable batched reward computation so all n outputs in a GRPO group are judged together by the LLM judge in a single call
A customized HumanLMRewardManager inheriting from RewardManagerBase routes to a different LLM judge system prompt depending on the latent user state being predicted (stance, emotion, belief, etc.)
A customized StateDataset that extends RLHFDataset and augments each data instance with distinct system prompts for each user state. This prompts the model to predict the user's stance, emotion, belief, etc. for each instance.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

The training rewards from a GRPO run on the HumanLM recipe are shown below. The critic/rewards/mean increases steadily from ~0.30 to ~0.45 over 150 steps, and the held-out validation score (val-core/reddit/reward/mean@1) improves steadily. This run produces a model that achieves 25.9 on the HUMANUAL-Opinion benchmark, matching the paper's reported number.

Model Samples

The following example is drawn from a Reddit AITA post in the HUMANUAL-Opinion benchmark where a woman confronts her husband after he sides with his traditional parents and refuses to cook dinner per their usual arrangement. The task is to predict what the user would say in response to the thread.

Ground truth user response:

"Spell this out for him and mommy. Today. They can't even live up to their ideals, why should OP???"

Trained HumanLM output:

"NTA. If a 'traditional' man is supposed to run the household, why is he taking his mom's advice over his wife's? He's not living up to tradition in any way. A real traditionalist would follow his wife's lead, not his mother's. Also, he's too broke to be sexist, but he's still letting his parents dictate his values. Classic."

Qwen base model output (no training):

"NTA. Traditional man? He took his mom's advice over his wife's. Classic double standard. If he's too broke to be sexist, maybe he should rethink his priorities. Kid'll be 30 and still in detention for not learning."

The trained model produces a more coherent, contextually grounded response and picks up on the irony of the situation and mirrors the reasoning of the ground truth. Unlike the base model, it stays on topic, and doesn't lose coherence.

gemini-code-assist · 2026-03-11T06:07:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the HumanLM training recipe, a novel approach to user simulation that focuses on aligning with psychological state dimensions rather than merely imitating response text. The changes introduce specialized components for agent behavior, reward evaluation, and data processing, allowing for more nuanced and human-like user simulations. This enhancement aims to produce models capable of generating more coherent and contextually grounded responses, as demonstrated by improved benchmark performance.

Highlights

HumanLMAgentLoop Customization: Implemented a customized HumanLMAgentLoop that inherits from AgentLoopBase, enabling chain-of-thought during response generation and applying an n-gram repetition penalty to improve output quality.
Batched Reward Computation: Introduced a customized HumanLMAgentLoopWorker that extends AgentLoopWorker to facilitate batched reward computation, allowing all 'n' outputs in a GRPO group to be judged together by the LLM judge in a single call.
Dynamic LLM Judge Prompting: Developed a customized HumanLMRewardManager inheriting from RewardManagerBase that intelligently routes to different LLM judge system prompts based on the specific latent user state being predicted (e.g., stance, emotion, belief).
State-Augmented Dataset: Created a customized StateDataset that extends RLHFDataset to augment each data instance with distinct system prompts for various user states, prompting the model to predict psychological dimensions like stance, emotion, and belief for each instance.

Changelog

humanlm/README.md
- Added comprehensive documentation for the HumanLM recipe, covering environment setup, dataset usage, SFT training, RL training, and citation.
humanlm/chat_templates/qwen3_multi_role_template_think.jinja
- Added a new Jinja template to support multi-role chat with integrated thinking capabilities.
humanlm/cluster_config.sh
- Added a new shell script for cluster configuration, defining environment variables and paths for LLM judge API keys, project directories, and cache locations.
humanlm/configs/humanlm_agent_loop_config.yaml
- Added a new YAML configuration file to register the HumanLMAgentLoop.
humanlm/humanlm_agent_loop.py
- Added a custom HumanLMAgentLoop class that inherits from AgentLoopBase, incorporating n-gram repetition penalty and handling custom chat templates and thinking modes.
humanlm/humanlm_agent_loop_worker.py
- Added a custom HumanLMAgentLoopWorker and HumanLMAgentLoopManager to extend AgentLoopWorker and AgentLoopManager respectively, enabling batched reward computation and injecting global steps.
humanlm/logging_utils.py
- Added a new utility file for logging and debugging, including functions for boolean conversion, string shortening, and debug printing.
humanlm/metrics/state_reward.py
- Added a new metric file for computing batched scores based on psychological state alignment, using LLM-as-judge.
humanlm/metrics/state_reward_on_response.py
- Added a new metric file for computing scores on responses based on multiple psychological aspects using an LLM-as-judge.
humanlm/process_dataset.py
- Added a new script for processing datasets, including persona formatting, thinking trace generation, and mapping to SFT/RL formats for various platforms.
humanlm/reward_function.py
- Added a custom HumanLMRewardManager that extends RewardManagerBase, handling asynchronous reward computation, metric aggregation, and state-specific system prompts.
humanlm/state_config/r.json
- Added a JSON configuration file defining the 'response' state.
humanlm/state_config/r_no_tag.json
- Added a JSON configuration file defining the 'response' state without tags.
humanlm/state_config/sebvgc.json
- Added a JSON configuration file defining multiple psychological states (stance, emotion, belief, value, goal, communication) with descriptions and thresholds.
humanlm/state_config/sebvgcr.json
- Added a JSON configuration file defining multiple psychological states (stance, emotion, belief, value, goal, communication, response) with descriptions and thresholds.
humanlm/state_config/think_r.json
- Added a JSON configuration file defining 'think' and 'response' states.
humanlm/state_dataset.py
- Added a custom StateDataset class that extends RLHFDataset, supporting state augmentation, dynamic system prompt substitution, and heterogeneous thinking.
humanlm/system_prompts/base.txt
- Added a base system prompt template for the HumanLM recipe.
humanlm/system_prompts/r_no_tag.txt
- Added a system prompt template for the 'response' state without tags.
humanlm/system_prompts/sebvgcr_belief.txt
- Added a system prompt template for the 'belief' state.
humanlm/system_prompts/sebvgcr_communication.txt
- Added a system prompt template for the 'communication' state.
humanlm/system_prompts/sebvgcr_emotion.txt
- Added a system prompt template for the 'emotion' state.
humanlm/system_prompts/sebvgcr_goal.txt
- Added a system prompt template for the 'goal' state.
humanlm/system_prompts/sebvgcr_response.txt
- Added a system prompt template for the 'response' state.
humanlm/system_prompts/sebvgcr_stance.txt
- Added a system prompt template for the 'stance' state.
humanlm/system_prompts/sebvgcr_value.txt
- Added a system prompt template for the 'value' state.
humanlm/train_rl_humanlm.sh
- Added a new shell script for training HumanLM using GRPO, including configuration for various modes and datasets.
humanlm/train_sft_humanlm.sh
- Added a new shell script for SFT training of HumanLM, with options for thinking mode and dataset percentages.
humanlm/utils.py
- Added a new utility file with functions for stripping non-response tags, extracting responses, and parsing messages and JSON.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: the HumanLM recipe for training user simulators. The changes are comprehensive, including new agent loops, reward functions, dataset processing scripts, and training configurations. The implementation shows a deep understanding of the training framework. My review focuses on ensuring the code is robust, portable, and maintainable. I've identified a few critical issues that would prevent the scripts from running correctly, some high-severity issues related to hardcoded paths and performance, and some medium-severity suggestions for improving code quality. Overall, this is a great contribution.

_{Note: Security Review did not run due to the size of the PR.}

wuxibin89 · 2026-03-31T05:31:39Z

Is the json and text file necessary or added accidentally?

ehejin · 2026-04-20T18:27:05Z

The JSON and text files were added intentionally. They store the system prompts used to generate the HumanLM-specific latent states.

ehejin and others added 3 commits February 26, 2026 20:03

added humanlm recipe

6a7dbd3

Merge branch 'verl-project:main' into humanlm

7b2702c

Merge branch 'verl-project:main' into humanlm

9b78cd3

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

ehejin added 2 commits March 10, 2026 23:37

fixed pre commit error

8962a22

made code-assist changes

b2a1c71

Delete humanlm/state_config/sebvgc.json

82b4b58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipe] feat: HumanLM integration for user simulation#63

[recipe] feat: HumanLM integration for user simulation#63
ehejin wants to merge 6 commits into
verl-project:mainfrom
ehejin:humanlm

ehejin commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuxibin89 commented Mar 31, 2026

Uh oh!

ehejin commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ehejin commented Mar 11, 2026

What does this PR do?

Test

Model Samples

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuxibin89 commented Mar 31, 2026

Uh oh!

ehejin commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants