Skip to content

Avoid half-deleted cache trees on concurrent pip-compile --rebuild#2384

Open
gaborbernat wants to merge 2 commits into
jazzband:mainfrom
gaborbernat:fix/clear-caches-parallel-safe
Open

Avoid half-deleted cache trees on concurrent pip-compile --rebuild#2384
gaborbernat wants to merge 2 commits into
jazzband:mainfrom
gaborbernat:fix/clear-caches-parallel-safe

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

@gaborbernat gaborbernat commented May 8, 2026

Two pip-compile --rebuild invocations sharing a cache_dir race inside PyPIRepository.clear_caches. Both call shutil.rmtree(self._download_dir, ignore_errors=True) on the same path, and the second resolver walks a tree the first started removing. ignore_errors=True masks the error but not the corruption.

The patch swaps the in-place delete for an atomic os.replace to a per-PID sibling directory, then deletes the renamed copy. os.replace is atomic on POSIX and NTFS, so a concurrent peer sees the original directory or no directory, never a half-deleted tree.

Scope is intentionally narrow per discussion with @sirosen below: this fixes one observable symptom of #2393, not the wider race where a process is already using the cache and a peer --rebuild later renames the directory out from under it. Full concurrent-rebuild safety needs locking around the resolve and is left for future work.

Refs #2393.

@gaborbernat gaborbernat marked this pull request as draft May 8, 2026 06:00
Copy link
Copy Markdown
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I feel about adding more mocks. The general preference is having more integration tests..

Comment thread tests/test_repository_pypi.py Outdated
Comment thread changelog.d/+clear-caches-parallel.bugfix.md Outdated
Comment thread changelog.d/+clear-caches-parallel.bugfix.md Outdated
@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch 3 times, most recently from ddd020c to 9eb0755 Compare May 8, 2026 16:37
@sirosen
Copy link
Copy Markdown
Member

sirosen commented May 8, 2026

Something I'm slightly unclear about is the ordering on this and safety offered. Isn't it a problem if the second process does the rename operation while the first process is using the cache directory?
Did I miss some detail which makes the collision I describe below impossible?

Since it's a concurrency issue, I put together a sequence diagram.

sequence diagram source
title \-\-rebuild collision

participant Process1
participant Process2
participant SharedCacheDir
participant Proc1CacheDir
participant Proc2CacheDir

activate SharedCacheDir
box over Process1:start
Process1->SharedCacheDir: clear_caches
SharedCacheDir->Proc1CacheDir: rename (by proc1)
deactivate SharedCacheDir
Process1->Proc1CacheDir: rmtree
Process1->SharedCacheDir: start using cache
activate SharedCacheDir
box over Process2:start
Process2->SharedCacheDir: clear_caches
SharedCacheDir-[#red]->Proc2CacheDir: rename (by proc2)
image

My thought is that once a process is running and using the cache directory, the second process would still move it, not being aware. We've guaranteed that concurrent removals appear to be atomic, but not that concurrent use of a shared cache directory is really safe.

@gaborbernat
Copy link
Copy Markdown
Contributor Author

You're right, the PR title oversells what the change actually delivers. The atomic rename only collapses the specific window where rmtree partially deletes a tree another resolver is reading; it does not protect a process that has already started using the freshly-cleared cache from a peer that arrives later, calls clear_caches again, and renames the directory out from under it.

The wider concurrent---rebuild story really wants a lock around the whole resolve, not just around the cache wipe. That's a bigger change than this PR scope and I'd rather not bolt it on here.

Three options I see, happy to take direction:

  1. Re-scope this PR to just "make rmtree atomic so fork CI / test runs cannot trip over partial trees", retitle accordingly, and open a separate issue/PR for proper file-based locking around the whole resolve.
  2. Close this and bundle the proper fix into the locking work later.
  3. Expand here to add fcntl / msvcrt.locking around clear_caches plus the resolve pass so --rebuild is genuinely concurrent-safe.

I lean toward (1) because it keeps the PR small and reviewable while still fixing a real (if narrower) symptom; (3) is the right end-state but a meaningfully larger surface. Which would you prefer?

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch 3 times, most recently from 96a590d to 7f6e67f Compare May 8, 2026 21:57
@gaborbernat gaborbernat marked this pull request as ready for review May 8, 2026 21:58
@sirosen
Copy link
Copy Markdown
Member

sirosen commented May 9, 2026

I lean toward (1) because it keeps the PR small and reviewable while still fixing a real (if narrower) symptom; (3) is the right end-state but a meaningfully larger surface. Which would you prefer?

I fully agree. Let's clarify that this is not a full fix for the identified issue, and we can go after a complete solution in the future.

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 7f6e67f to d117ea7 Compare May 9, 2026 19:33
@gaborbernat gaborbernat changed the title 🐛 fix(repo): make clear_caches safe under concurrent rebuild Avoid half-deleted cache trees on concurrent pip-compile --rebuild May 9, 2026
@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from d117ea7 to 5820e76 Compare May 9, 2026 19:38
@gaborbernat gaborbernat requested review from sirosen and webknjaz May 9, 2026 19:38
@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 5820e76 to 2a6030e Compare May 9, 2026 19:42
Copy link
Copy Markdown
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall shape looks good, but I've got a few more suggestions and requests below.

Comment thread changelog.d/2393.bugfix.md
Comment thread changelog.d/2393.bugfix.md Outdated
@@ -0,0 +1,6 @@
Atomically rename the download directory before deleting it in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using imperative mood, rephrase this to use a comparison with the previous release (X now does Y) or the past tense (X was done/changed to do Y). Imperative mood is good for commit messages but in a change log, it makes it look like a TODO-list, which is not at all what we want.

try:
os.replace(self._download_dir, stale)
except OSError:
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be logged in the very verbose mode. Especially since any arbitrary OSError is suppressed w/o checking for specific error codes that may change over time.

cc @sirosen WDYT?

Comment thread piptools/repositories/pypi.py Outdated
# around the resolve, which is out of scope here.
if not os.path.exists(self._download_dir):
return
stale = f"{self._download_dir}.stale-{os.getpid()}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more specific var name? stale is quite generic. old_downloads_dir, perhaps?


@pytest.fixture
def repo_with_cache_dir(tmp_path: Path) -> PyPIRepository:
return PyPIRepository([], cache_dir=str(tmp_path / "cache"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: cache_dir could've accepted pathlib objects already if we were to just relax typing in the initializer… An idea for a follow-up.

) -> None:
Path(repo_with_cache_dir._download_dir).mkdir(parents=True, exist_ok=True)
mocker.patch(
"piptools.repositories.pypi.os.replace",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already integrating pytest-mock, perhaps also apply a mocker.spy() on top, to confirm the args being passed?

Comment thread pyproject.toml Outdated
[project.optional-dependencies]
testing = [
"pytest >= 7.2.0",
"pytest-mock >= 3.15.1",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this version really the lowest compatible? Let's find out which one is necessary and set it to that. Or drop the lower bound if any version works.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only version we tested agaisnt and know to work, unless we have a CI that runs against lowest version we're really just hoping that those low relaxed versions work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the lower boundaries should reflect APIs relied upon (or known broken things). Without such knowledge, we shouldn't assume that something won't work on older versions, this would only cause problems for downstream packagers for no good reason. After all, the depresolvers would optimize for latest releases anyway.

On a related note, we should look into switching from extras to depgroups once that's supported (I know tox does support these, but I wanted to wait until pip-tools itself does too).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard that #2380 can help with that 😉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I did lower it to 3.3.0 that's the first compatible version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too big to consider at the moment + I was hoping #2175 would go in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊 one can dream and hope 😊

@webknjaz
Copy link
Copy Markdown
Member

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 254ea5b to dccaadd Compare May 14, 2026 17:09
Two ``pip-compile --rebuild`` invocations sharing a ``cache_dir`` race
inside ``PyPIRepository.clear_caches``. Both call
``shutil.rmtree(self._download_dir, ignore_errors=True)`` on the same
path, and the second resolver walks a tree the first started
removing. ``ignore_errors=True`` masks the error but not the
corruption.

The patch swaps the in-place delete for an atomic ``os.replace`` to a
per-PID sibling directory followed by ``rmtree`` of the renamed copy.
``os.replace`` is atomic on POSIX and NTFS, so a concurrent peer sees
either the old directory or no directory, never a half-deleted tree.

Scope is intentionally narrow: this fixes one observable symptom of
jazzband#2393, not the wider race where a process is already using the
cache and a peer ``--rebuild`` later renames the directory out from
under it. Full concurrent-rebuild safety needs locking around the
resolve and is left for future work.
@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from dccaadd to f129240 Compare May 14, 2026 17:20
@gaborbernat gaborbernat requested review from webknjaz May 14, 2026 17:33
@webknjaz
Copy link
Copy Markdown
Member

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from f129240 to e05cb04 Compare May 15, 2026 15:17
Comment thread piptools/repositories/pypi.py Outdated
old_downloads_dir = f"{self._download_dir}.stale-{os.getpid()}"
try:
os.replace(self._download_dir, old_downloads_dir)
except OSError as exc:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use an os_err instead of a generic var name?

Comment thread piptools/repositories/pypi.py Outdated
# move (cross-device, permissions). The directory is no longer
# ours to clear; log at -v since the suppressed OSError is
# otherwise invisible and its error codes shift across platforms.
log.debug(f"clear_caches skipped {self._download_dir!r}: {exc}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be

Suggested change
log.debug(f"clear_caches skipped {self._download_dir!r}: {exc}")
log.debug(f"clear_caches skipped {self._download_dir!s}: {os_err}")

? We probably don't need to call that repr interface.

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from e05cb04 to 828dace Compare May 15, 2026 15:22
Comment thread tests/test_cli_compile.py Outdated
assert "file:small_fake_with_deps-0.1-py2.py3-none-any.whl" in out.stderr


@pytest.mark.skipif(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this directly related to the patch? How? Why do you think it's deterministic?

Comment thread tests/test_cli_compile.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/jazzband/pip-tools/actions/runs/25870600925/job/76024490776#step:9:1111 proves that this test change is not related to this PR as it was failing on main before. Meaning this shouldn't be included in the scope.

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 828dace to 87816c9 Compare May 15, 2026 15:28
Comment thread pyproject.toml
testing = [
"pytest >= 7.2.0",
"pytest-mock >= 3.15.1",
"pytest-mock >= 3.3.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to a separate PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is needed for this PR. I don't know man this feels to me like we're being pedantic and generating lots of new work for negligable benefit 🤷

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good Git hygiene means being able to git revert things w/o additional edits by having atomic changes bundled. The addition of that dep with a wrong version spec slipped into main from another PR, so this PR doesn't need to change it, but the change still needs to happen separately.

I guess we could keep it in this PR if it moves to an atomic commit within the branch.

Copy link
Copy Markdown
Contributor Author

@gaborbernat gaborbernat May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I can sympathize with the sentiment here, at the same time there is a trade-off to be made about chasing perfection and balancing the return on the time investment we are asking of people. My personal stance is that we should aim for good enough. Not everything needs to be a revert, and often, when something goes wrong in my experience, it's easier to just follow it up with the pull request that either fixes that issue or reverses the part of a bigger PR that didn't work out. This very easily could be just a difference of style when contributing and maintaining projects between us.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, it's just not good enough. Good shape of Git is important to me. Otherwise, we'd be allowing fake squash merges and other questionable approaches.

@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 87816c9 to 9f09086 Compare May 15, 2026 15:44
@webknjaz
Copy link
Copy Markdown
Member

The spy question stands. Also waiting for Stephen's feedback on questions. Perhaps, this should be marked as a draft while being constantly force-pushed?

@gaborbernat
Copy link
Copy Markdown
Contributor Author

Perhaps, this should be marked as a draft while being constantly force-pushed?

It's ready from POV, I'm just addressing your constant feedback, hence constant force-push.

@webknjaz
Copy link
Copy Markdown
Member

FYI, it's a good tone to follow https://hynek.me/til/easier-crediting-contributors-github/.

webknjaz's review left several requests on the parallel-rebuild fix.

The renamed-aside directory variable moves from the generic `stale` to
`old_downloads_dir` so the name says what it holds. The suppressed
`OSError` now logs at `-v` rather than vanishing: a racing peer or an
OS-level rename refusal is worth a breadcrumb, and the error codes that
trigger it shift across platforms, so silently swallowing it hides a
real signal.

The changelog fragment moves out of imperative mood into the "X now
does Y" form the towncrier convention expects, routes the second
`pip-compile --rebuild` mention through the `{command}` role, and gains
the `by` keyword in the byline. `changelog.d/2384.bugfix.md` is a
symlink to `2393.bugfix.md` so the note renders under both the issue
and the PR.

The `pytest-mock` lower bound drops to match the unbounded style of the
sibling test dependencies (`pytest-rerunfailures`, `pytest-xdist`,
`tomli-w`); pinning one test helper to a single tested version while
the rest float was inconsistent. `test_clear_caches_swallows_replace_
failure` now asserts the rename was attempted against the live download
directory before the OSError short-circuits the rmtree.
@gaborbernat gaborbernat force-pushed the fix/clear-caches-parallel-safe branch from 9f09086 to b58c4bb Compare May 15, 2026 15:46
@webknjaz
Copy link
Copy Markdown
Member

This was almost ready, I'd rather see it completed (maybe by my or Stephen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants