ElasticTrain is a local Docker simulation of a small Slurm-managed cluster that runs Ray workloads. It is useful for experimenting with burst-style worker orchestration before moving the same ideas to a real HPC or cloud environment.
The project starts one Slurm controller container and two Slurm worker containers. Ray runs on the controller as the head node, while Ray worker processes are launched through Slurm allocations.
- Docker
- Docker Compose, either as
docker composeordocker-compose - A shell that can run the included Bash scripts
.
├── Dockerfile # Ubuntu image with Slurm, Munge, Ray, and Python dependencies
├── compose.yaml # Local controller and worker containers
├── entrypoint.sh # Starts Munge and the correct Slurm daemon per container role
├── run.sh # Builds and starts the cluster, then attaches Ray workers
├── simulate-autoscale.sh # Demonstrates scaling Ray workers through Slurm jobs
├── slurm.conf # Slurm controller, worker, node, and partition config
└── shared-data/
├── autoscale_load.py # Long-running Ray load generator
└── elastic_task.py # Simple Ray task placement example
Start the local cluster:
./run.shWhen the script completes, the Ray dashboard is available at:
http://localhost:8265
Run the simple Ray workload:
docker exec -it slurm-head python3 /data/elastic_task.pyYou should see Ray cluster resources, task placement by container hostname, and completed task messages.
After starting the cluster with ./run.sh, run:
./simulate-autoscale.shThis script:
- Cancels existing Ray worker Slurm jobs so the scale-up is visible.
- Starts a long-running Ray workload with no worker capacity attached.
- Adds
worker-1through a Slurm allocation. - Adds
worker-2through a second Slurm allocation. - Streams workload progress and final task placement.
- Cancels the worker jobs to scale back down.
You can tune the workload with environment variables:
TASKS=80 SECONDS_PER_TASK=30 OBSERVE_SECONDS=10 ./simulate-autoscale.shDefaults:
TASKS=40SECONDS_PER_TASK=20OBSERVE_SECONDS=8
Check Slurm nodes:
docker exec -it slurm-head sinfo -NelCheck Slurm jobs:
docker exec -it slurm-head squeueCheck Ray status:
docker exec -it slurm-head ray statusOpen a shell in the controller:
docker exec -it slurm-head bashStop and remove the local cluster:
docker compose down -vIf your system uses the standalone Compose binary:
docker-compose down -vcompose.yaml creates three containers on a shared Docker network:
slurm-head: Slurm controller and Ray head nodeworker-1: Slurm worker nodeworker-2: Slurm worker node
run.sh rebuilds the image, starts the containers, waits for Slurm to register the nodes, starts Ray on slurm-head, and launches Ray workers with:
srun --job-name=ray-workers --nodes=2 --ntasks=2 ray start --address=<head-ip>:6379 --block--block keeps each Ray worker process tied to its Slurm allocation, which makes worker lifetime visible and controllable through Slurm.
- The deterministic Munge key in the Docker image is only for this local simulation. Real clusters must use a private generated key distributed securely to every node.
slurm.confdeclares each worker withCPUs=10andRealMemory=2048. Adjust those values if you change the simulation size or want different Ray capacity.- The project mounts
./shared-datato/datain every container, so Python workload files can be edited locally and run inside the cluster without rebuilding the image.
If Slurm nodes do not become ready, inspect node state:
docker exec -it slurm-head sinfo -NelIf Ray workers do not join, check Ray status:
docker exec -it slurm-head ray statusIf you want a clean restart, run:
docker compose down -v
./run.sh