fix: mindspeed-mm backend adopt custom trainer's gradient clipping an…#84
fix: mindspeed-mm backend adopt custom trainer's gradient clipping an…#84Zhang1Sheng wants to merge 1 commit into
Conversation
…d optimizer methods
There was a problem hiding this comment.
Code Review
This pull request updates the MMActorRolloutRefWorker to ensure model parameters are correctly moved to the GPU before rollout and introduces a registered init_model method for handling optimizer steps. The review feedback identifies a critical regression where the removal of init_context during trainer initialization could lead to out-of-memory errors. Additionally, it is recommended to use a specific actor_trainer attribute instead of a generic trainer to prevent state being overwritten when the worker manages both actor and reference models.
| ) | ||
|
|
||
| with init_context(), warnings.catch_warnings(): | ||
| with warnings.catch_warnings(): |
There was a problem hiding this comment.
The init_context manager is defined on line 283 but is not used in the with statement. This is a regression that may lead to high memory usage or OOM during model initialization, as the meta-tensor context or other initialization optimizations won't be applied to the Trainer instantiation.
| with warnings.catch_warnings(): | |
| with init_context, warnings.catch_warnings(): |
| self.mm_args.parallel.fsdp_plan.cpu_offload = True | ||
|
|
||
| trainer = Trainer(args=self.mm_args, dataloader_provider=self._dataloader) | ||
| self.trainer = trainer |
There was a problem hiding this comment.
Using a single self.trainer attribute is problematic if the worker manages both an actor and a reference model (e.g., in Role.ActorRolloutRef mode). The reference model's trainer will overwrite the actor's trainer, causing the _optimizer_step (defined in init_model) to use the wrong model and a None optimizer. It is safer to explicitly store the actor's trainer.
| self.trainer = trainer | |
| if role == "actor": | |
| self.actor_trainer = trainer |
| grad_norm = clip_grad_norm(self.trainer.model, max_norm=self.mm_args.training.clip_grad, foreach=self.mm_args.training.clip_grad_foreach) | ||
| # Update parameters | ||
| if self._is_actor: | ||
| self.trainer.optimizer.step() |
There was a problem hiding this comment.
Update this to use self.actor_trainer to ensure the correct model and optimizer are used, especially when both actor and reference models are present in the same worker. This prevents using the reference model's trainer (which has no optimizer) for the actor's update step.
| grad_norm = clip_grad_norm(self.trainer.model, max_norm=self.mm_args.training.clip_grad, foreach=self.mm_args.training.clip_grad_foreach) | |
| # Update parameters | |
| if self._is_actor: | |
| self.trainer.optimizer.step() | |
| grad_norm = clip_grad_norm(self.actor_trainer.model, max_norm=self.mm_args.training.clip_grad, foreach=self.mm_args.training.clip_grad_foreach) | |
| # Update parameters | |
| if self._is_actor: | |
| self.actor_trainer.optimizer.step() |
| class MMActorRolloutRefWorker(ActorRolloutRefWorker): | ||
| def __init__(self, config: DictConfig, role: str, **kwargs): | ||
| super().__init__(config, role, **kwargs) | ||
| self.trainer = None |
mindspeed-mm backend adopt custom trainer's gradient clipping and optimizer methods