Skip to content

Chapter 9: Your First Experiment


Let's Do This For Real

You've built the entire system. SSH keys configured. Tmux sessions persisting. Proxy tunnels forwarding. Claude Code installed, authenticated, and running inside a persistent session. CLAUDE.md files teaching it about your servers, your GPUs, your conda environments, your research conventions. Hooks firing automatically. Watchdog monitoring GPU utilization. Cron jobs reading health summaries. The research workflow defined from idea to paper.

Eight chapters of infrastructure. Now let's use it.

We're going to run a real experiment end-to-end. Not a toy example. Not a hypothetical walkthrough. A real training run on a real GPU, with real metrics logged to WandB, monitored by the system you built. You'll watch the whole thing from your phone.

The task: fine-tune DistilBERT on SST-2 sentiment classification using LoRA. It's small enough to finish in under 30 minutes on a single GPU. But it's complex enough to exercise every piece of the pipeline — code generation, environment setup, data downloading, training, monitoring, failure recovery, and results delivery.

By the end of this chapter, you'll have run your first AI-managed experiment. And you'll understand, viscerally, what it feels like when the system works.


Step 1: Define the Task

Before we touch the keyboard, let's be precise about what we're doing.

What: Fine-tune a pretrained language model for binary sentiment classification.

Model: DistilBERT — a smaller, faster version of BERT. It has 66 million parameters, about 40% fewer than BERT-base, but retains 97% of its performance. Small enough to train quickly, large enough to be a real model.

Dataset: SST-2 (Stanford Sentiment Treebank, binary classification). Movie review sentences labeled as positive or negative. About 67,000 training examples and 872 validation examples. It's part of the GLUE benchmark and has been the standard sentiment classification dataset for years.

Method: LoRA (Low-Rank Adaptation). Instead of fine-tuning all 66 million parameters, LoRA freezes the pretrained weights and injects small trainable matrices into the attention layers. This reduces the number of trainable parameters from 66 million to roughly 300,000 — less than 0.5% of the original model. Training is faster, memory usage is lower, and the results are nearly as good as full fine-tuning.

Why this task: Three reasons. First, it's fast — 3 epochs on SST-2 with LoRA takes about 15-25 minutes on a single 4090. You won't be waiting hours to see results. Second, it's easy to evaluate — accuracy on a binary classification task is unambiguous. You either got it right or you didn't. Third, it touches every part of the pipeline: downloading a model from HuggingFace, downloading a dataset, configuring training hyperparameters, logging to WandB, saving checkpoints, and evaluating on a held-out set. If the system can handle this, it can handle anything.


Step 2: Tell CC What to Do

Open your phone. Launch Termius. SSH into your local machine. Attach to the tmux session where Claude Code is running:

bash
tmux attach -t claude

Now give Claude Code the instruction. Here's the exact prompt:

Create a project to fine-tune DistilBERT on SST-2 using LoRA.
- Use HuggingFace transformers + peft
- Log everything to WandB (project name: "vibe-research-demo")
- Training: 3 epochs, batch size 16, learning rate 2e-4
- Evaluate on the validation set after training
- Save the best checkpoint

That's it. Five lines. You typed this on your phone in about 30 seconds.

Now watch Claude Code work. It doesn't ask clarifying questions — the instruction is specific enough. It starts generating the project structure:

~/Claude/Research/sst2-lora-demo/
├── CLAUDE.md          # Project dashboard and constraints
├── train.py           # Training script
├── requirements.txt   # Dependencies
└── tools/
    └── watchdog.py    # Monitoring script (copied from template)

It writes train.py — a complete training script with HuggingFace's Trainer API, PEFT's LoRA configuration, WandB integration, dataset loading, tokenization, evaluation metrics, and checkpoint saving. Not a skeleton. Not a template with TODOs. A working script.

It writes requirements.txttransformers, peft, datasets, accelerate, wandb, scikit-learn, pinned to compatible versions.

It writes the project CLAUDE.md — Pipeline Status set to implementation, the idea described, training constraints noted.

All of this happens in about 60-90 seconds. You watch the code scroll by on your phone screen. You don't need to read every line. You're checking that it's doing the right thing at a high level: yes, it's using LoRA; yes, it's logging to WandB; yes, it's evaluating on the validation set. The details are Claude Code's job.


Step 3: Sync and Setup

Claude Code doesn't stop after writing code. It knows the next step: get the code to the server and set up the environment.

First, it checks which server has free GPUs:

bash
ssh b2 "nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv,noheader"

It picks the server with the most free GPUs. Let's say it's b2. It creates the remote directory and syncs the code:

bash
ssh b2 "mkdir -p /mnt/bit/Jingxuan/Research/sst2-lora-demo"
rsync -avz --delete --filter=':- .gitignore' \
  --exclude='.git/' --exclude='wandb/' --exclude='outputs/' \
  ~/Claude/Research/sst2-lora-demo/ b2:/mnt/bit/Jingxuan/Research/sst2-lora-demo/

Then it SSHes in, sets up the conda environment, and installs dependencies:

bash
ssh b2 "cd /mnt/bit/Jingxuan/Research/sst2-lora-demo && \
  /home/hqyy/anaconda3/bin/conda create -n sst2-demo python=3.10 -y && \
  /home/hqyy/anaconda3/envs/sst2-demo/bin/pip install -r requirements.txt"

This takes a few minutes — conda resolving dependencies, pip downloading packages. Claude Code waits for it to finish. You don't need to watch this part. Check your phone again in 5 minutes, or don't. The system doesn't need you.

Once the environment is ready, Claude Code verifies that the model and dataset can be downloaded. If the server needs the proxy (which you set up in Chapter 4), it's already available through the SSH tunnel. Claude Code knows this — you told it in your CLAUDE.md.


Step 4: Training

Here's where the training actually starts. But Claude Code doesn't just throw the script into tmux and walk away. It follows the protocol you defined in your CLAUDE.md: new code gets a foreground smoke test first.

Claude Code runs the training script directly over SSH, watching the output in real time:

bash
ssh b2 "cd /mnt/bit/Jingxuan/Research/sst2-lora-demo && \
  /home/hqyy/anaconda3/envs/sst2-demo/bin/python train.py"

It watches the first few training steps scroll by:

[2026-03-23 14:32:15] Loading model: distilbert-base-uncased
[2026-03-23 14:32:18] Loading dataset: sst2
[2026-03-23 14:32:20] Applying LoRA config: r=8, alpha=16, dropout=0.1
[2026-03-23 14:32:20] Trainable parameters: 296,450 / 66,955,010 (0.44%)
[2026-03-23 14:32:21] WandB run initialized: vibe-research-demo/run-abc123
[2026-03-23 14:32:22] Epoch 1/3 | Step 10/12564 | Loss: 0.693 | LR: 2.00e-04
[2026-03-23 14:32:24] Epoch 1/3 | Step 20/12564 | Loss: 0.641 | LR: 1.99e-04
[2026-03-23 14:32:26] Epoch 1/3 | Step 30/12564 | Loss: 0.578 | LR: 1.99e-04

Loss is decreasing. No errors. No NaN. No CUDA OOM. The smoke test passes.

Claude Code kills the foreground process and restarts it properly inside tmux, with all the environment variables set:

bash
ssh b2 "tmux new -d -s sst2-train ' \
  export WANDB_API_KEY=wandb_v1_... && \
  export http_proxy=http://127.0.0.1:10808 && \
  export https_proxy=http://127.0.0.1:10808 && \
  cd /mnt/bit/Jingxuan/Research/sst2-lora-demo && \
  /home/hqyy/anaconda3/envs/sst2-demo/bin/python train.py \
'"

The training is now running in a persistent tmux session. It will survive SSH disconnections, terminal closures, network hiccups. Nothing short of a server reboot will stop it.

Claude Code starts the watchdog and sets up cron monitoring — all automatically, triggered by the PostToolUse hook you configured in Chapter 7. You didn't ask it to do this. The system does it because you told it to, once, in the configuration.


Step 5: Monitor From Your Phone

The training is running. Claude Code is monitoring it. Now let's see what you see.

Option 1: Raw terminal output. Open Termius on your phone. SSH into b2 directly. Attach to the training session:

bash
ssh b2
tmux attach -t sst2-train

You see the live training output scrolling by. Loss values, learning rate, steps per second. This is the most direct view — exactly what you'd see if you were sitting in front of the server.

Detach with Ctrl-B d when you're done looking. The training keeps running.

Option 2: WandB dashboard. Open your phone's browser. Go to wandb.ai. Navigate to the vibe-research-demo project. You see:

  • A loss curve that's steadily decreasing
  • Learning rate schedule showing the warmup and decay
  • GPU utilization hovering around 85-95%
  • Memory usage well within the 24GB limit
  • Estimated time to completion

WandB is the more useful view for quick checks. You can see at a glance whether things are going well. The loss curve tells you more than a wall of text.

Option 3: Ask Claude Code. From your phone, attach to the Claude Code tmux session on your local machine and just ask:

What's the status of the SST-2 training?

Claude Code checks the watchdog summary, reads the latest training metrics, and gives you a one-line answer:

Training on b2, epoch 2/3, step 8400/12564, loss 0.312,
GPU util 91%. ETA ~8 minutes. No issues.

Three ways to check. All from your phone. Pick whichever feels right in the moment.

Now put the phone down. Go make dinner. Walk the dog. The system is working.


Step 6: Something Goes Wrong

Let's talk about failure. Not hypothetical failure — the kind of failure that happens in every real training run sooner or later.

Here's the scenario. You're making dinner. The training has been running for about 15 minutes. Everything looked fine the last time you checked. But something has gone wrong.

The learning rate of 2e-4, which worked fine for the first epoch, is causing instability in the second epoch as the loss landscape narrows. The gradients spike. The loss jumps to NaN. The training script catches it and exits with an error:

RuntimeError: Loss is NaN at step 5230. Training terminated.

The tmux session is still alive, but the Python process is dead. The GPUs are idle.

Here's what happens next — without you doing anything.

11:47:23 PM — The watchdog script, which checks GPU utilization every 30 seconds, detects that all GPUs assigned to the sst2-train session are at 0% utilization. It writes a status update:

json
{"task": "sst2-train", "status": "DEAD", "reason": "GPU util 0% for >60s",
 "last_active": "2026-03-23T23:46:51"}

The summary file updates: sst2-train: DEAD (GPU idle >60s)

11:48:00 PM — Claude Code's cron job fires. It reads summary.txt. It sees DEAD. The cron job triggers Claude Code to investigate.

11:48:05 PM — Claude Code SSHes into b2 and reads the training output:

bash
ssh b2 "tmux capture-pane -t sst2-train -p | tail -20"

It sees the NaN loss error. It reads the full traceback. It identifies the problem: the loss went to NaN at step 5230, which is partway through epoch 2. The learning rate at that point was still relatively high.

11:48:15 PM — Claude Code diagnoses the root cause. Learning rate 2e-4 is too aggressive for LoRA fine-tuning on this model when the loss landscape becomes sharper in later training steps. The standard fix: reduce the learning rate. It edits train.py on the local machine, changing the learning rate from 2e-4 to 5e-5. It also adds a cosine learning rate scheduler with warmup to prevent this from happening again.

11:48:30 PM — Claude Code syncs the updated code to b2:

bash
rsync -avz --delete --filter=':- .gitignore' \
  --exclude='.git/' --exclude='wandb/' --exclude='outputs/' \
  ~/Claude/Research/sst2-lora-demo/ b2:/mnt/bit/Jingxuan/Research/sst2-lora-demo/

11:48:40 PM — Claude Code kills the dead tmux session and starts a fresh one:

bash
ssh b2 "tmux kill-session -t sst2-train"
ssh b2 "tmux new -d -s sst2-train ' \
  export WANDB_API_KEY=wandb_v1_... && \
  cd /mnt/bit/Jingxuan/Research/sst2-lora-demo && \
  /home/hqyy/anaconda3/envs/sst2-demo/bin/python train.py \
'"

11:48:50 PM — Claude Code watches the first few steps of the restarted training to confirm the fix works:

Step 10/12564 | Loss: 0.687 | LR: 5.00e-05
Step 20/12564 | Loss: 0.654 | LR: 5.00e-05
Step 30/12564 | Loss: 0.612 | LR: 5.00e-05

Loss is decreasing. No NaN. The fix worked.

11:49:00 PM — Claude Code logs the incident to findings.md:

markdown
## [2026-03-23] NaN loss during SST-2 LoRA training
- LR 2e-4 caused NaN at step 5230 (epoch 2)
- Root cause: LR too aggressive for LoRA on DistilBERT
- Fix: reduced to 5e-5, added cosine scheduler with warmup
- Training restarted successfully

It updates the project CLAUDE.md with the new training status. It updates WandB with a note on the run.

Total time from crash to recovery: 97 seconds.

You were making dinner. You didn't see any of this. The next time you check your phone — maybe before bed, maybe the next morning — you'll see a note from Claude Code:

Training crashed at 11:47pm (NaN loss, LR too high).
Fixed: reduced LR from 2e-4 to 5e-5, added cosine scheduler.
Restarted at 11:49pm. Currently epoch 2/3, loss 0.34, running normally.

That's it. A crash that would have cost you 8 hours of idle GPU time — the kind of crash described in Chapter 1 — was detected, diagnosed, and fixed in under two minutes. While you were doing something else entirely.

This is the whole point. Not that failures don't happen. They always happen. The point is that failures get handled.


Step 7: Results

The training finishes around 12:15 AM. You're asleep. That's fine.

Claude Code detects the training completion — the watchdog sees the process exit cleanly, and the final log line says Training completed successfully. Claude Code reads the final metrics:

Best validation accuracy: 91.3% (epoch 2, step 8400)
Final validation accuracy: 90.8% (epoch 3, step 12564)
Best checkpoint saved to: outputs/checkpoint-best/
WandB run: https://wandb.ai/your-username/vibe-research-demo/runs/abc123

91.3% accuracy on SST-2 with LoRA. That's in line with expectations — full fine-tuning of DistilBERT typically gets 91-92%, and LoRA gets within a percentage point of that. The model learned to classify sentiment correctly on over 91% of unseen movie review sentences. With 0.44% of the parameters trainable. In under 30 minutes.

Claude Code rsyncs the results back to your local machine:

bash
rsync -avz --exclude='checkpoint-*/' --exclude='*.safetensors' --exclude='*.bin' \
  b2:/mnt/bit/Jingxuan/Research/sst2-lora-demo/outputs/ \
  ~/Claude/Research/sst2-lora-demo/outputs/

It pulls the training logs, the evaluation metrics, the WandB artifacts — everything except the model weights themselves, which stay on the server.

It updates Experiment.md:

markdown
## SST-2 LoRA Fine-tuning (2026-03-23)

**Task:** Binary sentiment classification on SST-2
**Model:** distilbert-base-uncased + LoRA (r=8, alpha=16)
**Trainable params:** 296,450 / 66,955,010 (0.44%)
**Training:** 3 epochs, batch 16, lr 5e-5 (cosine w/ warmup)
**Best accuracy:** 91.3% (epoch 2)
**Final accuracy:** 90.8% (epoch 3)
**WandB:** vibe-research-demo/run-abc123
**Server:** b2, GPU 0, ~25 min total training time
**Note:** Initial lr 2e-4 caused NaN at step 5230;
reduced to 5e-5 with cosine scheduler.

It updates the project CLAUDE.md Pipeline Status:

yaml
stage: training
idea: "LoRA fine-tune DistilBERT on SST-2"
training_status: completed
next: results delivered, experiment complete

You wake up the next morning. You check your phone. Everything is there: the accuracy numbers, the WandB link, the incident report about the NaN fix, the complete experiment record. You didn't write any code. You didn't set up any environments. You didn't stay up monitoring the training. You didn't even know about the crash until you read the summary.


The Payoff

Let's step back and look at what just happened.

You typed five lines into your phone. Claude Code did everything else:

  1. Wrote the code. A complete training script with model loading, LoRA configuration, dataset preprocessing, WandB logging, evaluation, and checkpoint management.

  2. Set up the server. Picked the right machine, created the conda environment, installed all dependencies, verified the model and dataset were accessible.

  3. Managed the training. Ran a foreground smoke test, then launched the real training in tmux with proper environment variables and monitoring.

  4. Detected and fixed a failure. When the learning rate caused NaN loss, the watchdog detected it, Claude Code diagnosed the root cause, applied a fix, and restarted training — all in under two minutes, while you were making dinner.

  5. Delivered the results. Synced metrics back to your local machine, wrote the experiment report, logged everything to WandB, updated the project dashboard.

You did this from your phone. From your couch. While making dinner and sleeping. Your GPU was never idle for more than two minutes. You never cancelled plans. You never stayed up late watching a loss curve.

This is what eight chapters of infrastructure buys you. Not a faster way to do the same old workflow — a fundamentally different relationship with your compute. The GPU works for you. The AI manages the GPU. You manage the AI. From wherever you are, whenever you want, with whatever device is in your hand.

You'll never babysit a GPU again.


What's Next

This was a simple task. One model, one dataset, one GPU, 30 minutes. The real power of the system you've built shows up when the experiments get bigger and the stakes get higher.

Multi-GPU training across multiple servers. Your CLAUDE.md tells Claude Code about all your servers — their GPUs, their memory, their available datasets. When you need to train a larger model, Claude Code picks the right server (or multiple servers), configures distributed training, and manages the runs across machines. You still type five lines on your phone.

Overnight ablation studies. You want to test 12 hyperparameter configurations. Claude Code launches them across available GPUs, monitors all of them simultaneously, kills the ones that are clearly failing, and presents you with a comparison table in the morning. You sleep through the whole thing.

Paper-driven research. You point Claude Code at a paper and say "reproduce the main result." It reads the paper, identifies the model architecture, finds the dataset, writes the training code, and runs the experiment. When the results don't match, it debugs. When they do match, it moves on to your proposed improvement.

Complete research pipelines. The research workflow from Chapter 8 — idea discovery, implementation, training, paper writing — runs end-to-end with Claude Code orchestrating every stage. You provide the research direction and make the key decisions. Claude Code handles the execution.

The infrastructure you built in eight chapters supports all of this. The SSH tunnels, the tmux sessions, the proxy configuration, the CLAUDE.md files, the hooks, the watchdog, the monitoring — none of that changes. You scale up from here by giving Claude Code bigger tasks, not by rebuilding the system.

Start with something real. Pick a paper you've been meaning to reproduce. Pick a dataset you've been curious about. Open your phone, tell Claude Code what to do, and see what happens.

The system is ready. Use it.


Checkpoint

Training completed. Metrics logged to WandB. You checked it from your phone. The system fixed a bug while you were away. Congratulations — you've built an automated research workflow. You'll never babysit a GPU again.

Released under the MIT License.