Table of Contents 1 Executive Summary 2 Core Design Principles 3 Non-Goals 4 System Overview 5 Hardware Architecture (Recommended) 5.1 Goals (hardware) 5.2 Option A: 2U total (recommended) 5.3 Option B: 1U total (aggressive) 5.4 Option C: “DGX Spark as Debug Box” 5.5 Storage in colo 5.6 Power Budget (planning) 5.7 Space Budget (rack units) 5.8 Colo notes 6 Network & Communication Diagram (Textual) 7 Storage Model 7.1 Invariant 7.2 Checkpoint Flow 7.3 Artifacts & Logs 8 Unified Checkpointing (JAX Pytrees) 9 Repository Structure (Tech Spec) 10 Execution Model 10.1 Job Lifecycle 10.2 Backend Interface 11 DAG / “Ray-lite” Model 12 Example YAML Specifications 12.1 Backend Inventory 12.2 Storage 12.3 Single Run Spec 12.4 DAG Spec (Tokenize → Train → Rollouts) 13 Implementation Plan 13.1 Phase 0 (2–3 weeks) 13.2 Phase 1 (4–6 weeks) 13.3 Phase 2 (3–4 weeks) 13.4 Phase 3 (optional) 14 Cost Estimates 14.1 One-Time Hardware (Target ~50k) 14.2 Ongoing 15 Risks & Mitigations 16 Success Criteria 17 Conclusion After a conversation with an LM https://chatgpt.com/share/697143db-c3e0-8000-b56c-07cf7ca43795 the following proposal was generated. 1 Executive Summary Research labs consistently suffer from fragile, bespoke infrastructure that fails under preemption, heterogeneous clusters, and rapid iteration. This project proposes AdventureTime: a minimal, JAX-first training and experimentation infrastructure designed for small frontier research groups (~10 people) with access to heterogeneous compute. The system prioritizes two non-negotiable guarantees: If an experiment runs locally, it runs on any cluster by changing only the submit command. No experiment ever loses progress; all workloads are preemption-safe and resumable. AdventureTime achieves this by unifying all workloads (training, eval, tokenization, API rollouts, DAG workflows) under a single abstraction: checkpointable JAX pytrees, paired with a lightweight control-plane scheduler and object-store-backed artifact system. The total infrastructure cost target is ~50k, with modest ongoing operational costs. 2 Core Design Principles JAX-first (Flax, Optax, Orbax) All resumable state is a JAX pytree Restart is cheap; elasticity is optional No Kubernetes, no Ray, no containers UV for dependency management SSH + SLURM + object storage as primary integration points Control plane orchestrates; workers are stateless 3 Non-Goals Enterprise multi-tenant auth Perfect elastic world-size training Replacing SLURM or cloud schedulers Building a general-purpose data lake Long-lived actors or services on workers 4 System Overview AdventureTime consists of three layers: Control Plane (colo-hosted) Execution Backends (heterogeneous clusters) Runtime Library + CLI (monorepo) All interaction is mediated via job specs, checkpoint manifests, and object storage. 5 Hardware Architecture (Recommended) 5.1 Goals (hardware) One always-on control node (scheduler + state + UI + adapters) One interactive “debug box” for SSH ingress and editing (e.g., Emacs/tmux) No GPU requirement in colo; GPUs live in external clusters Small footprint: ~2U rack total is acceptable; 1U is possible with tradeoffs 5.2 Option A: 2U total (recommended) 5.2.1 1U Control Plane Server (always-on) Role: scheduler/controller, DB, logging UI, artifact index, backend adapters Specs (baseline): CPU: 16–32 cores (e.g., EPYC / Xeon) RAM: 128–256 GB Storage: OS: mirrored SSD (e.g., 2×1TB) Local scratch/cache: 2–8TB NVMe (single or mirrored; not canonical) Network: 10GbE preferred (1GbE workable) Notes: This node should be stable, boring, and easy to replace. 5.2.2 1U Debug / SSH Bastion (interactive) Role: SSH endpoint, “human box,” editor, small-scale local runs, diagnostics Specs: CPU: 8–16 cores RAM: 64–128 GB Storage: 1–2TB SSD Notes: Can also host small services (docs preview, dashboards) if desired. 5.3 Option B: 1U total (aggressive) Single 1U machine runs everything (control + debug) Risk: interrupts/reboots/maintenance hit both scheduling and your “human box” Acceptable only if you’re okay with occasional coordination pauses. 5.4 Option C: “DGX Spark as Debug Box” If DGX Spark is available, treat it as: Debug / interactive SSH box Not mandatory for control-plane correctness Control plane remains a boring 1U server. 5.5 Storage in colo Weka is optional. For this project’s goals, treat Weka as a hot cache/staging layer, not a dependency. Canonical storage is S3-compatible object storage. 5.6 Power Budget (planning) (Exact draw depends on chosen servers; below is a conservative sizing guide.) Control plane 1U server: ~150–350W typical, ~500W peak Debug/bastion 1U server: ~100–250W typical, ~400W peak 10GbE switch (small): ~20–60W typical Total typical: ~300–660W Total peak (safe provision): ~900–1,200W Power provisioning recommendation: Budget 1.2kW on the PDU for comfort Single 120V/15A circuit can be tight at peak; prefer 120V/20A or 208V if available 5.7 Space Budget (rack units) Option A: 2U servers + optional 1U switch = 2U–3U total Option B: 1U total + optional 1U switch = 1U–2U total Cabling: plan front-to-back airflow, short DACs for 10GbE where possible 5.8 Colo notes Put the control plane on UPS-backed power (colo UPS or your own small UPS if permitted) Maintain remote serial / out-of-band management (iDRAC/iLO) for recovery 6 Network & Communication Diagram (Textual) [Dev Laptop] | | adventuretime run/submit (SSH/HTTPS) v [Debug/Bastion Box] (Emacs/tmux, human ingress) | | (SSH, internal) v [Control Plane Server] (scheduler, state, UI) | |-- SSH --> [SLURM login A] --> sbatch --> compute nodes |-- SSH --> [SLURM login B] --> sbatch --> compute nodes |-- SSH --> [SLURM login C] --> sbatch --> compute nodes | |-- HTTPS --> [S3-compatible Object Store] | |-- LAN --> [Optional Hot Cache (Weka/NAS)] Workers never communicate with each other or the control plane beyond optional heartbeats. 7 Storage Model 7.1 Invariant All resumable state is a JAX pytree plus a small JSON manifest. 7.2 Checkpoint Flow [Worker Scratch Disk] └── ckpt.tmp/ ├── orbax blobs └── metadata.json | | upload blobs v [S3://runs/<run_id>/ckpt/<step>/...] | | upload manifest.json LAST v Checkpoint committed atomically The presence of a manifest indicates a valid checkpoint. 7.3 Artifacts & Logs Artifacts (JSONL, images, tables) are written in parts Each part is immutable A manifest tracks completion Same mechanism for API rollouts and training 8 Unified Checkpointing (JAX Pytrees) All workloads checkpoint a pytree: Training: model params optimizer state RNG state step counters API rollouts / DAG nodes: cursor / index RNG seed cached responses (optional) progress metadata CheckpointManager API: save(pytree, step) -> CheckpointRef latest() -> Optional[CheckpointRef] restore(target_pytree) -> pytree Orbax is used under the hood; transport is abstracted. 9 Repository Structure (Tech Spec) monorepo/ adventuretime/ cli/ main.py ; run/submit/status/logs core/ env.py ; RunEnv spec.py ; JobSpec, DAGSpec, ResourceSpec registry.py ; experiment discovery heartbeat.py ; liveness + preemption hooks ckpt/ pytree.py ; pytree API orbax.py ; orbax adapters manifest.py ; atomic checkpoint manifests transport.py ; local <-> object store io/ datasets.py ; dataset refs + caching artifacts.py ; artifact refs cache.py ; scratch cache mgmt log/ events.py ; structured metrics sink_wandb.py ; optional wandb sink_selfhost.py ; self-host UI client backends/ base.py ; Backend interface slurm.py ; sbatch emitter + watcher ssh.py ; direct SSH executor gcp.py ; cloud fallback sched/ controller.py ; reconcile loop planner.py ; backend selection state.py ; sqlite/pg run state queue.py ; DAG execution dag/ model.py ; Node, Edge, Resources exec.py ; node runner experiments/ <exp>.py ; returns Job or DAG lego/ datasets/ layers/ models/ optimizers/ configs/ backends.yaml storage.yaml clusters/ <cluster>.yaml 10 Execution Model 10.1 Job Lifecycle Each run has a stable run_id Each submission attempt increments attempt_id Scheduler reconciles desired vs observed state On failure or preemption: find latest checkpoint resubmit on next viable backend 10.2 Backend Interface Each backend implements: submit(JobSpec) -> JobHandle poll(JobHandle) -> state cancel(JobHandle) tail_logs(JobHandle) SLURM adapter parses exit codes and reasons to detect preemption. 11 DAG / “Ray-lite” Model DAG nodes are jobs, not actors Nodes request resources Nodes checkpoint state Retries happen at node granularity Within-node parallelism uses: JAX multihost Python multiprocessing This avoids Ray’s complexity while retaining fault tolerance. 12 Example YAML Specifications 12.1 Backend Inventory backends: slurm_a: type: slurm ssh_host: login-a.example.edu ssh_user: houjun sbatch_defaults: partition: gpu time: "24:00:00" slurm_b: type: slurm ssh_host: login-b.example.org ssh_user: houjun sbatch_defaults: partition: preempt time: "12:00:00" gcp: type: gcp project: myproj region: us-central2 12.2 Storage storage: object: type: s3 endpoint: "https://s3.example.com" bucket: "adventuretime" prefix: "runs" hot_cache: type: weka mount: "/mnt/weka" enabled: true 12.3 Single Run Spec run: id: "fork-mid-2026-01-21-001" experiment: "experiments/fork_mid.py:build" resources: gpus: 8 gpu_type: "H100|A100|any" policy: preemptible: true checkpoint_interval_sec: 120 backend_selector: order: ["slurm_a", "slurm_b", "gcp"] 12.4 DAG Spec (Tokenize → Train → Rollouts) dag: id: "ragdoll-2026-01-21" nodes: - id: tokenize entry: "experiments/tokenize.py:node" resources: { cpus: 32 } - id: train entry: "experiments/fork_mid.py:build" needs: [tokenize] resources: { gpus: 16 } - id: rollouts entry: "experiments/ragdoll_api.py:node" needs: [train] resources: { cpus: 16 } 13 Implementation Plan 13.1 Phase 0 (2–3 weeks) CLI skeleton RunEnv Single SLURM backend Pytree checkpoint manager (local + S3) 13.2 Phase 1 (4–6 weeks) Scheduler reconcile loop Multi-backend failover Preemption detection Unified logging 13.3 Phase 2 (3–4 weeks) DAG execution API rollout support Dataset + artifact caching 13.4 Phase 3 (optional) Shrink-only topology changes Smarter backend planning UI polish 14 Cost Estimates 14.1 One-Time Hardware (Target ~50k) Item Cost (USD) 1U Control Plane server6–12k 1U Debug/SSH Bastion (interactive) 3–8k 10GbE switch + DACs0.5–2k Colo + networking (1 yr) 5–10k Optional hot cache (NAS/Weka-like)3–12k Buffer / spares / rails / misc 2–5k Total ~20–50k Notes: This budget intentionally does NOT include GPUs. If you already have colo networking/space, the bottom end is realistic. If you actually deploy Weka proper, it can push you toward the top end. 14.2 Ongoing Object storage: low-to-moderate (checkpoints + artifacts; depends on retention) Colo: recurring monthly fee (varies widely) Maintenance: ~0.25–0.5 FTE systems effort 15 Risks & Mitigations Heterogeneous cluster quirks → adapter isolation + retry semantics Checkpoint corruption → manifest-based atomic commits Scheduler complexity → narrow scope; no intra-cluster scheduling Research velocity slowdown → local-first workflow preserved 16 Success Criteria Local debug → cluster run requires no code changes Preempted jobs resume automatically No lost experiments over 3+ months Researchers add lego modules without infra changes 17 Conclusion AdventureTime is intentionally narrow, opinionated infrastructure that trades breadth for reliability. By unifying all workloads under JAX pytree checkpointing and delegating scheduling to a lightweight control plane, it provides frontier-grade robustness at a cost and complexity appropriate for small research labs.