Skip to content

jake-ogle/ElasticTrain

Repository files navigation

ElasticTrain

ElasticTrain is a local Docker simulation of a small Slurm-managed cluster that runs Ray workloads. It is useful for experimenting with burst-style worker orchestration before moving the same ideas to a real HPC or cloud environment.

The project starts one Slurm controller container and two Slurm worker containers. Ray runs on the controller as the head node, while Ray worker processes are launched through Slurm allocations.

Requirements

  • Docker
  • Docker Compose, either as docker compose or docker-compose
  • A shell that can run the included Bash scripts

Project Layout

.
├── Dockerfile                 # Ubuntu image with Slurm, Munge, Ray, and Python dependencies
├── compose.yaml               # Local controller and worker containers
├── entrypoint.sh              # Starts Munge and the correct Slurm daemon per container role
├── run.sh                     # Builds and starts the cluster, then attaches Ray workers
├── simulate-autoscale.sh      # Demonstrates scaling Ray workers through Slurm jobs
├── slurm.conf                 # Slurm controller, worker, node, and partition config
└── shared-data/
    ├── autoscale_load.py      # Long-running Ray load generator
    └── elastic_task.py        # Simple Ray task placement example

Quick Start

Start the local cluster:

./run.sh

When the script completes, the Ray dashboard is available at:

http://localhost:8265

Run the simple Ray workload:

docker exec -it slurm-head python3 /data/elastic_task.py

You should see Ray cluster resources, task placement by container hostname, and completed task messages.

Autoscale Simulation

After starting the cluster with ./run.sh, run:

./simulate-autoscale.sh

This script:

  1. Cancels existing Ray worker Slurm jobs so the scale-up is visible.
  2. Starts a long-running Ray workload with no worker capacity attached.
  3. Adds worker-1 through a Slurm allocation.
  4. Adds worker-2 through a second Slurm allocation.
  5. Streams workload progress and final task placement.
  6. Cancels the worker jobs to scale back down.

You can tune the workload with environment variables:

TASKS=80 SECONDS_PER_TASK=30 OBSERVE_SECONDS=10 ./simulate-autoscale.sh

Defaults:

  • TASKS=40
  • SECONDS_PER_TASK=20
  • OBSERVE_SECONDS=8

Useful Commands

Check Slurm nodes:

docker exec -it slurm-head sinfo -Nel

Check Slurm jobs:

docker exec -it slurm-head squeue

Check Ray status:

docker exec -it slurm-head ray status

Open a shell in the controller:

docker exec -it slurm-head bash

Stop and remove the local cluster:

docker compose down -v

If your system uses the standalone Compose binary:

docker-compose down -v

How It Works

compose.yaml creates three containers on a shared Docker network:

  • slurm-head: Slurm controller and Ray head node
  • worker-1: Slurm worker node
  • worker-2: Slurm worker node

run.sh rebuilds the image, starts the containers, waits for Slurm to register the nodes, starts Ray on slurm-head, and launches Ray workers with:

srun --job-name=ray-workers --nodes=2 --ntasks=2 ray start --address=<head-ip>:6379 --block

--block keeps each Ray worker process tied to its Slurm allocation, which makes worker lifetime visible and controllable through Slurm.

Notes

  • The deterministic Munge key in the Docker image is only for this local simulation. Real clusters must use a private generated key distributed securely to every node.
  • slurm.conf declares each worker with CPUs=10 and RealMemory=2048. Adjust those values if you change the simulation size or want different Ray capacity.
  • The project mounts ./shared-data to /data in every container, so Python workload files can be edited locally and run inside the cluster without rebuilding the image.

Troubleshooting

If Slurm nodes do not become ready, inspect node state:

docker exec -it slurm-head sinfo -Nel

If Ray workers do not join, check Ray status:

docker exec -it slurm-head ray status

If you want a clean restart, run:

docker compose down -v
./run.sh

About

Experimentation with elastic compute pools on slurm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors