Skip to content

[recipe] feat: HumanLM integration for user simulation#63

Open
ehejin wants to merge 6 commits into
verl-project:mainfrom
ehejin:humanlm
Open

[recipe] feat: HumanLM integration for user simulation#63
ehejin wants to merge 6 commits into
verl-project:mainfrom
ehejin:humanlm

Conversation

@ehejin

@ehejin ehejin commented Mar 11, 2026

Copy link
Copy Markdown

What does this PR do?

This PR adds HumanLM as a training recipe. It includes the following:

  • A customized HumanLMAgentLoop inheriting from AgentLoopBase that enables chain-of-thought only during response generation and applies n-gram repetition penalty
  • A customized HumanLMAgentLoopWorker that extends AgentLoopWorker to enable batched reward computation so all n outputs in a GRPO group are judged together by the LLM judge in a single call
  • A customized HumanLMRewardManager inheriting from RewardManagerBase routes to a different LLM judge system prompt depending on the latent user state being predicted (stance, emotion, belief, etc.)
  • A customized StateDataset that extends RLHFDataset and augments each data instance with distinct system prompts for each user state. This prompts the model to predict the user's stance, emotion, belief, etc. for each instance.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

The training rewards from a GRPO run on the HumanLM recipe are shown below. The critic/rewards/mean increases steadily from ~0.30 to ~0.45 over 150 steps, and the held-out validation score (val-core/reddit/reward/mean@1) improves steadily. This run produces a model that achieves 25.9 on the HUMANUAL-Opinion benchmark, matching the paper's reported number.

Screen Shot 2026-03-10 at 10 01 53 PM Screen Shot 2026-03-10 at 10 01 22 PM

Model Samples

The following example is drawn from a Reddit AITA post in the HUMANUAL-Opinion benchmark where a woman confronts her husband after he sides with his traditional parents and refuses to cook dinner per their usual arrangement. The task is to predict what the user would say in response to the thread.

Ground truth user response:

"Spell this out for him and mommy. Today. They can't even live up to their ideals, why should OP???"

Trained HumanLM output:

"NTA. If a 'traditional' man is supposed to run the household, why is he taking his mom's advice over his wife's? He's not living up to tradition in any way. A real traditionalist would follow his wife's lead, not his mother's. Also, he's too broke to be sexist, but he's still letting his parents dictate his values. Classic."

Qwen base model output (no training):

"NTA. Traditional man? He took his mom's advice over his wife's. Classic double standard. If he's too broke to be sexist, maybe he should rethink his priorities. Kid'll be 30 and still in detention for not learning."

The trained model produces a more coherent, contextually grounded response and picks up on the irony of the situation and mirrors the reasoning of the ground truth. Unlike the base model, it stays on topic, and doesn't lose coherence.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the HumanLM training recipe, a novel approach to user simulation that focuses on aligning with psychological state dimensions rather than merely imitating response text. The changes introduce specialized components for agent behavior, reward evaluation, and data processing, allowing for more nuanced and human-like user simulations. This enhancement aims to produce models capable of generating more coherent and contextually grounded responses, as demonstrated by improved benchmark performance.

Highlights

  • HumanLMAgentLoop Customization: Implemented a customized HumanLMAgentLoop that inherits from AgentLoopBase, enabling chain-of-thought during response generation and applying an n-gram repetition penalty to improve output quality.
  • Batched Reward Computation: Introduced a customized HumanLMAgentLoopWorker that extends AgentLoopWorker to facilitate batched reward computation, allowing all 'n' outputs in a GRPO group to be judged together by the LLM judge in a single call.
  • Dynamic LLM Judge Prompting: Developed a customized HumanLMRewardManager inheriting from RewardManagerBase that intelligently routes to different LLM judge system prompts based on the specific latent user state being predicted (e.g., stance, emotion, belief).
  • State-Augmented Dataset: Created a customized StateDataset that extends RLHFDataset to augment each data instance with distinct system prompts for various user states, prompting the model to predict psychological dimensions like stance, emotion, and belief for each instance.
Changelog
  • humanlm/README.md
    • Added comprehensive documentation for the HumanLM recipe, covering environment setup, dataset usage, SFT training, RL training, and citation.
  • humanlm/chat_templates/qwen3_multi_role_template_think.jinja
    • Added a new Jinja template to support multi-role chat with integrated thinking capabilities.
  • humanlm/cluster_config.sh
    • Added a new shell script for cluster configuration, defining environment variables and paths for LLM judge API keys, project directories, and cache locations.
  • humanlm/configs/humanlm_agent_loop_config.yaml
    • Added a new YAML configuration file to register the HumanLMAgentLoop.
  • humanlm/humanlm_agent_loop.py
    • Added a custom HumanLMAgentLoop class that inherits from AgentLoopBase, incorporating n-gram repetition penalty and handling custom chat templates and thinking modes.
  • humanlm/humanlm_agent_loop_worker.py
    • Added a custom HumanLMAgentLoopWorker and HumanLMAgentLoopManager to extend AgentLoopWorker and AgentLoopManager respectively, enabling batched reward computation and injecting global steps.
  • humanlm/logging_utils.py
    • Added a new utility file for logging and debugging, including functions for boolean conversion, string shortening, and debug printing.
  • humanlm/metrics/state_reward.py
    • Added a new metric file for computing batched scores based on psychological state alignment, using LLM-as-judge.
  • humanlm/metrics/state_reward_on_response.py
    • Added a new metric file for computing scores on responses based on multiple psychological aspects using an LLM-as-judge.
  • humanlm/process_dataset.py
    • Added a new script for processing datasets, including persona formatting, thinking trace generation, and mapping to SFT/RL formats for various platforms.
  • humanlm/reward_function.py
    • Added a custom HumanLMRewardManager that extends RewardManagerBase, handling asynchronous reward computation, metric aggregation, and state-specific system prompts.
  • humanlm/state_config/r.json
    • Added a JSON configuration file defining the 'response' state.
  • humanlm/state_config/r_no_tag.json
    • Added a JSON configuration file defining the 'response' state without tags.
  • humanlm/state_config/sebvgc.json
    • Added a JSON configuration file defining multiple psychological states (stance, emotion, belief, value, goal, communication) with descriptions and thresholds.
  • humanlm/state_config/sebvgcr.json
    • Added a JSON configuration file defining multiple psychological states (stance, emotion, belief, value, goal, communication, response) with descriptions and thresholds.
  • humanlm/state_config/think_r.json
    • Added a JSON configuration file defining 'think' and 'response' states.
  • humanlm/state_dataset.py
    • Added a custom StateDataset class that extends RLHFDataset, supporting state augmentation, dynamic system prompt substitution, and heterogeneous thinking.
  • humanlm/system_prompts/base.txt
    • Added a base system prompt template for the HumanLM recipe.
  • humanlm/system_prompts/r_no_tag.txt
    • Added a system prompt template for the 'response' state without tags.
  • humanlm/system_prompts/sebvgcr_belief.txt
    • Added a system prompt template for the 'belief' state.
  • humanlm/system_prompts/sebvgcr_communication.txt
    • Added a system prompt template for the 'communication' state.
  • humanlm/system_prompts/sebvgcr_emotion.txt
    • Added a system prompt template for the 'emotion' state.
  • humanlm/system_prompts/sebvgcr_goal.txt
    • Added a system prompt template for the 'goal' state.
  • humanlm/system_prompts/sebvgcr_response.txt
    • Added a system prompt template for the 'response' state.
  • humanlm/system_prompts/sebvgcr_stance.txt
    • Added a system prompt template for the 'stance' state.
  • humanlm/system_prompts/sebvgcr_value.txt
    • Added a system prompt template for the 'value' state.
  • humanlm/train_rl_humanlm.sh
    • Added a new shell script for training HumanLM using GRPO, including configuration for various modes and datasets.
  • humanlm/train_sft_humanlm.sh
    • Added a new shell script for SFT training of HumanLM, with options for thinking mode and dataset percentages.
  • humanlm/utils.py
    • Added a new utility file with functions for stripping non-response tags, extracting responses, and parsing messages and JSON.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: the HumanLM recipe for training user simulators. The changes are comprehensive, including new agent loops, reward functions, dataset processing scripts, and training configurations. The implementation shows a deep understanding of the training framework. My review focuses on ensuring the code is robust, portable, and maintainable. I've identified a few critical issues that would prevent the scripts from running correctly, some high-severity issues related to hardcoded paths and performance, and some medium-severity suggestions for improving code quality. Overall, this is a great contribution.

Note: Security Review did not run due to the size of the PR.

Comment thread humanlm/humanlm_agent_loop.py Outdated
Comment thread humanlm/train_sft_humanlm.sh Outdated
Comment thread humanlm/humanlm_agent_loop.py Outdated
Comment thread humanlm/metrics/state_reward.py Outdated
Comment thread humanlm/metrics/state_reward_on_response.py Outdated
Comment thread humanlm/reward_function.py Outdated
Comment thread humanlm/process_dataset.py
Comment thread humanlm/reward_function.py
Comment thread humanlm/reward_function.py Outdated
Comment thread humanlm/train_sft_humanlm.sh
@wuxibin89

Copy link
Copy Markdown
Contributor

Is the json and text file necessary or added accidentally?

@ehejin

ehejin commented Apr 20, 2026

Copy link
Copy Markdown
Author

The JSON and text files were added intentionally. They store the system prompts used to generate the HumanLM-specific latent states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants