DiCoRGI

Environment Setup

First, you need to set up the conda environment on NYU GREENE:

Read the JUPYTER.md file for detailed instructions on setting up the conda environment on NYU GREENE
```
cat JUPYTER.md
```
Follow the instructions in JUPYTER.md to create and activate your conda environment

Package Installation

After setting up the conda environment, install the required packages:

Run the req.sbatch script to download and install all necessary packages for the conda environment
```
sbatch req.sbatch
```
Check the status of your job using squeue -u $USER to ensure it completes successfully

Data Generation

Generate the training data:

Navigate to the data_gen folder
```
cd data_gen
```
Run the rearc.sbatch script to generate 10,000 examples for each of the 400 ARC problems
```
sbatch rearc.sbatch
```
This process may take some time to complete. You can monitor the job status using:
```
squeue -u $USER
```

Data Verification

Verify that the data generation was successful:

Open and run the visualization.ipynb notebook to inspect the generated data
```
jupyter notebook visualization.ipynb
```
Ensure that the notebook shows the correct number of examples (10K for each problem) and that the data format is as expected

Model Training

After data generation and verification, proceed with the model training:

Navigate to the llada folder
```
cd ../llada
```
Run the parallel processing script
```
sbatch llada_parallel.sbatch
```
After the parallel processing completes, run the Supervised Fine-Tuning (SFT) script
```
sbatch llada_sft.sbatch
```
Monitor both jobs using squeue -u $USER and check the output logs for any errors

Troubleshooting

If you encounter any issues:

Check the job output logs in the slurm output files (typically named as slurm-JOBID.out)
Verify that all paths in the sbatch scripts are correct
Ensure that the conda environment is properly activated in each sbatch script
Check for sufficient disk space and compute resources

Notes

The data generation step (rearc.sbatch) creates 10,000 examples for each of the 400 ARC problems, which will require significant disk space
The llada_parallel.sbatch and llada_sft.sbatch scripts will utilize GPU resources, so make sure your allocation has sufficient GPU time available
Depending on your resource allocation, you may need to adjust the resource requests in the sbatch scripts

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data_gen		data_gen
llada		llada
.gitattributes		.gitattributes
GREENE.md		GREENE.md
JUPYTER.md		JUPYTER.md
README.md		README.md
req.out		req.out
req.sbatch		req.sbatch
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiCoRGI

Environment Setup

Package Installation

Data Generation

Data Verification

Model Training

Troubleshooting

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiCoRGI

Environment Setup

Package Installation

Data Generation

Data Verification

Model Training

Troubleshooting

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages