ElasticTrain

ElasticTrain is a local Docker simulation of a small Slurm-managed cluster that runs Ray workloads. It is useful for experimenting with burst-style worker orchestration before moving the same ideas to a real HPC or cloud environment.

The project starts one Slurm controller container and two Slurm worker containers. Ray runs on the controller as the head node, while Ray worker processes are launched through Slurm allocations.

Requirements

Docker
Docker Compose, either as docker compose or docker-compose
A shell that can run the included Bash scripts

Project Layout

.
├── Dockerfile                 # Ubuntu image with Slurm, Munge, Ray, and Python dependencies
├── compose.yaml               # Local controller and worker containers
├── entrypoint.sh              # Starts Munge and the correct Slurm daemon per container role
├── run.sh                     # Builds and starts the cluster, then attaches Ray workers
├── simulate-autoscale.sh      # Demonstrates scaling Ray workers through Slurm jobs
├── slurm.conf                 # Slurm controller, worker, node, and partition config
└── shared-data/
    ├── autoscale_load.py      # Long-running Ray load generator
    └── elastic_task.py        # Simple Ray task placement example

Quick Start

Start the local cluster:

./run.sh

When the script completes, the Ray dashboard is available at:

http://localhost:8265

Run the simple Ray workload:

docker exec -it slurm-head python3 /data/elastic_task.py

You should see Ray cluster resources, task placement by container hostname, and completed task messages.

Autoscale Simulation

After starting the cluster with ./run.sh, run:

./simulate-autoscale.sh

This script:

Cancels existing Ray worker Slurm jobs so the scale-up is visible.
Starts a long-running Ray workload with no worker capacity attached.
Adds worker-1 through a Slurm allocation.
Adds worker-2 through a second Slurm allocation.
Streams workload progress and final task placement.
Cancels the worker jobs to scale back down.

You can tune the workload with environment variables:

TASKS=80 SECONDS_PER_TASK=30 OBSERVE_SECONDS=10 ./simulate-autoscale.sh

Defaults:

TASKS=40
SECONDS_PER_TASK=20
OBSERVE_SECONDS=8

Useful Commands

Check Slurm nodes:

docker exec -it slurm-head sinfo -Nel

Check Slurm jobs:

docker exec -it slurm-head squeue

Check Ray status:

docker exec -it slurm-head ray status

Open a shell in the controller:

docker exec -it slurm-head bash

Stop and remove the local cluster:

docker compose down -v

If your system uses the standalone Compose binary:

docker-compose down -v

How It Works

compose.yaml creates three containers on a shared Docker network:

slurm-head: Slurm controller and Ray head node
worker-1: Slurm worker node
worker-2: Slurm worker node

run.sh rebuilds the image, starts the containers, waits for Slurm to register the nodes, starts Ray on slurm-head, and launches Ray workers with:

srun --job-name=ray-workers --nodes=2 --ntasks=2 ray start --address=<head-ip>:6379 --block

--block keeps each Ray worker process tied to its Slurm allocation, which makes worker lifetime visible and controllable through Slurm.

Notes

The deterministic Munge key in the Docker image is only for this local simulation. Real clusters must use a private generated key distributed securely to every node.
slurm.conf declares each worker with CPUs=10 and RealMemory=2048. Adjust those values if you change the simulation size or want different Ray capacity.
The project mounts ./shared-data to /data in every container, so Python workload files can be edited locally and run inside the cluster without rebuilding the image.

Troubleshooting

If Slurm nodes do not become ready, inspect node state:

docker exec -it slurm-head sinfo -Nel

If Ray workers do not join, check Ray status:

docker exec -it slurm-head ray status

If you want a clean restart, run:

docker compose down -v
./run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ElasticTrain

Requirements

Project Layout

Quick Start

Autoscale Simulation

Useful Commands

How It Works

Notes

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
shared-data		shared-data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
compose.yaml		compose.yaml
entrypoint.sh		entrypoint.sh
run.sh		run.sh
simulate-autoscale.sh		simulate-autoscale.sh
slurm.conf		slurm.conf

Folders and files

Latest commit

History

Repository files navigation

ElasticTrain

Requirements

Project Layout

Quick Start

Autoscale Simulation

Useful Commands

How It Works

Notes

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages