Skip to content

[Dataset] CC3M#37

Open
snkv04 wants to merge 20 commits into
galilai-group:mainfrom
snkv04:main
Open

[Dataset] CC3M#37
snkv04 wants to merge 20 commits into
galilai-group:mainfrom
snkv04:main

Conversation

@snkv04

@snkv04 snkv04 commented Jan 26, 2026

Copy link
Copy Markdown

What does this PR do?

It implements the retrieval script for CC3M and tests that script.

A usage example is at stable_datasets/tests/images/test_cc3m.py.

Note: The training split is somewhat large and can take a while to download even at over 1GB/second of images. Here are the statistics from when I ran it, for reference:

  • train split
    • Space:
      • 270GB of space for the downloads
      • 300MB of space for the processed cache
    • Time:
      • 4 hours to download
      • 21 minutes to yield all the examples to build the dataset
  • val split
    • Space:
      • 1.4GB of space for the downloads
      • 1.6MB of space for the processed cache
    • Time:
      • 1-2 minutes to download
      • A few seconds to yield all the examples to build the dataset

Before submitting

  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines
  • Did you write any new necessary tests?

Who can review?

@RandallBalestriero @Leon-Leyang

@snkv04

snkv04 commented Feb 1, 2026

Copy link
Copy Markdown
Author

By the way @Leon-Leyang if you want to test this dataset without retrieving the full set of data items so that the tests will run quicker, you can set FIRST_N_IMAGES_PER_SPLIT in cc3m.py to be smaller (e.g., 2000)

@Leon-Leyang

Copy link
Copy Markdown
Collaborator

By the way @Leon-Leyang if you want to test this dataset without retrieving the full set of data items so that the tests will run quicker, you can set FIRST_N_IMAGES_PER_SPLIT in cc3m.py to be smaller (e.g., 2000)

Got it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants