[recipe] feat: HumanLM integration for user simulation#63
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the HumanLM training recipe, a novel approach to user simulation that focuses on aligning with psychological state dimensions rather than merely imitating response text. The changes introduce specialized components for agent behavior, reward evaluation, and data processing, allowing for more nuanced and human-like user simulations. This enhancement aims to produce models capable of generating more coherent and contextually grounded responses, as demonstrated by improved benchmark performance. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: the HumanLM recipe for training user simulators. The changes are comprehensive, including new agent loops, reward functions, dataset processing scripts, and training configurations. The implementation shows a deep understanding of the training framework. My review focuses on ensuring the code is robust, portable, and maintainable. I've identified a few critical issues that would prevent the scripts from running correctly, some high-severity issues related to hardcoded paths and performance, and some medium-severity suggestions for improving code quality. Overall, this is a great contribution.
Note: Security Review did not run due to the size of the PR.
|
Is the json and text file necessary or added accidentally? |
|
The JSON and text files were added intentionally. They store the system prompts used to generate the HumanLM-specific latent states. |
What does this PR do?
This PR adds HumanLM as a training recipe. It includes the following:
Test
The training rewards from a GRPO run on the HumanLM recipe are shown below. The critic/rewards/mean increases steadily from ~0.30 to ~0.45 over 150 steps, and the held-out validation score (val-core/reddit/reward/mean@1) improves steadily. This run produces a model that achieves 25.9 on the HUMANUAL-Opinion benchmark, matching the paper's reported number.
Model Samples
The following example is drawn from a Reddit AITA post in the HUMANUAL-Opinion benchmark where a woman confronts her husband after he sides with his traditional parents and refuses to cook dinner per their usual arrangement. The task is to predict what the user would say in response to the thread.
Ground truth user response:
Trained HumanLM output:
Qwen base model output (no training):
The trained model produces a more coherent, contextually grounded response and picks up on the irony of the situation and mirrors the reasoning of the ground truth. Unlike the base model, it stays on topic, and doesn't lose coherence.