Skip to content

fix: restore global_step from .states when resuming training#18112

Open
ltianyi992 wants to merge 1 commit into
PaddlePaddle:mainfrom
ltianyi992:fix/issue-18108-warning-resume-training-loses-global-ste
Open

fix: restore global_step from .states when resuming training#18112
ltianyi992 wants to merge 1 commit into
PaddlePaddle:mainfrom
ltianyi992:fix/issue-18108-warning-resume-training-loses-global-ste

Conversation

@ltianyi992

Copy link
Copy Markdown

Summary

Closes #18108

load_model() in ppocr/utils/save_load.py correctly restores start_epoch from the .states file, but silently drops global_step. Because program.train() initialises global_step from pre_best_model_dict["global_step"] (lines 232–234), the counter resets to 0 on every resume — corrupting log timestamps, eval scheduling, VisualDL step indices, and checkpoint metadata.

Root cause

Both code paths inside load_model() read the .states file and populate best_model_dict, but neither included:

if "global_step" in states_dict:
    best_model_dict["global_step"] = states_dict["global_step"]

Fix

Add the two-line restore to both paths:

  1. The KIE path (metric.states — line ~92)
  2. The main path (.states — line ~162)

No other logic changed. program.train() already has the consumer code; this patch just feeds it the correct value.

Test plan

  • Resume a training run from a checkpoint saved at step N; verify logs show global_step starting at N+1 instead of 1
  • Verify no regression for fresh-start training (no .states file present)
  • Verify KIE model resume also restores global_step correctly

load_model() saved start_epoch from the .states file but silently
dropped global_step, causing resumed training to restart the step
counter at 0. This broke log timestamps, eval scheduling, VisualDL
steps, and checkpoint metadata.

program.train() already reads global_step from pre_best_model_dict
(lines 232-234) — this patch simply populates that key in both code
paths inside load_model():
- the KIE (metric.states) path
- the main (.states) path

Closes PaddlePaddle#18108
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


james seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes resume-training state propagation by restoring global_step from the checkpoint .states/metric.states files inside ppocr/utils/save_load.py, preventing tools/program.py from resetting the step counter to 0 on resume.

Changes:

  • Restore global_step from states_dict into best_model_dict in the KIE resume path (metric.states).
  • Restore global_step from states_dict into best_model_dict in the standard resume path (.states).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warning:Resume training loses global_step because load_model does not restore global_step from .states

3 participants