Chapter 1: Philosophy
The bottleneck is not AI capability — it's your tooling and workflow. We have models that can write code, debug errors, and reason about complex systems. Yet most researchers still SSH into a server, launch a script, and pray. The gap between what AI can do and what researchers actually use it for is enormous. This guide exists to close that gap.
— Inspired by Andrej Karpathy
The GPU Babysitting Problem
You know this routine.
It's Thursday evening. You've been tuning hyperparameters all day and you finally have a configuration that looks promising. You kick off the training run. Estimated time: 14 hours. That means results by Friday morning — if everything goes well.
You want to go home. You want to eat dinner with your family, watch a movie, get a full night's sleep. But you don't. You stay. Or you leave and spend the entire evening checking your phone, refreshing WandB, SSHing in from your laptop on the couch to make sure the loss is still going down.
Your friends invite you out on Saturday. You hesitate. The ablation study is running on three different seeds. What if one of them crashes? What if the data loader hits a corrupted sample and throws an exception at epoch 40? You've seen it happen before. You cancel.
This is the GPU babysitting problem. It's not a technical problem — your code works, your server works, your data is fine. It's a workflow problem. The entire pipeline depends on a human being present, alert, and ready to intervene at any moment. You are the monitoring system. You are the crash detector. You are the error handler. And you are terrible at all three of these jobs, because you also need to sleep, eat, think, read papers, and occasionally remember that life exists outside the lab.
Every experienced deep learning researcher has internalized this tax. They don't even complain about it anymore. It's just how things are. You launch a run, you check on it, you hope for the best. The anxiety becomes background noise — always there, never fully acknowledged.
But it doesn't have to be this way.
Silent Failures
Here's the scenario that really hurts.
It's a Tuesday night. You've launched a large-scale training run before leaving the lab — 4 GPUs, a dataset you spent two weeks preprocessing, and a novel architecture you've been iterating on for a month. You go home feeling good. Tomorrow morning, you'll have the first real results.
You wake up at 7am. You SSH in. The tmux session is still there, but the process is dead. You scroll up through the output. There it is: CUDA out of memory at epoch 3. The timestamp says 11:47pm. That means your four GPUs have been sitting completely idle for over seven hours. Seven hours of compute, wasted. Not because the experiment failed — failures are fine, failures are information — but because nobody was there to notice, diagnose, and restart.
Now here's the worse version. You're at a conference. Or visiting your parents. Or on a rare vacation. You get a nagging feeling around midnight and check your phone. You can SSH in from Termius, and you can see the training crashed. You can even read the error log. But fixing it requires editing a config file, changing the batch size, and restarting the script. From your phone. With your thumbs. At 1am in a hotel room. So you don't. You tell yourself you'll deal with it in the morning. Another eight hours of idle GPUs. Another day lost.
The cruelest part of silent failures isn't the wasted compute. It's the wasted time. Your GPU hours are expensive, yes. But your calendar is more expensive. That training run was supposed to give you results for the Friday meeting. Now you're a day behind. And a day behind means you're making decisions under pressure, cutting corners on the next experiment, rushing the analysis. One silent failure cascades through your entire research timeline.
What If Someone Was Always Watching?
Now imagine something different.
It's Tuesday night again. Same training run, same 4 GPUs, same novel architecture. You go home. At 11:47pm, the training crashes with the same OOM error. But this time, something happens.
An AI agent — running on your local machine, connected to the server over SSH — detects that the GPU utilization has dropped to zero. It reads the error log. It identifies the problem: batch size too large for the activation memory at that particular layer depth. It edits the config, reduces the batch size, adjusts the gradient accumulation steps to compensate, and restarts the training. The whole process takes ninety seconds. By 11:49pm, training is running again.
You wake up at 7am. You pull out your phone, open Termius, and type: "status?" The AI responds: training crashed once at 11:47pm due to OOM, restarted with batch size 16 instead of 32, currently at epoch 27, loss looks normal. You nod, put your phone away, and go make coffee.
That's the promise of this guide.
Not artificial general intelligence. Not a system that does your research for you. Something much more practical: an AI assistant that watches your training runs 24/7, detects crashes, reads the error logs, fixes the obvious problems, and restarts. It syncs your code, manages your environments, and keeps your GPUs busy. All you need is a phone and an occasional glance.
You stay the researcher. You make the decisions — which experiments to run, which hypotheses to test, which results matter. But you stop being the babysitter. The midnight alarm checks, the cancelled weekends, the anxious refreshing — all of that becomes someone else's job. And that someone never sleeps, never gets distracted, and never forgets to check.
Architecture Overview
The system has three nodes. Every piece of this guide connects to this architecture, so it's worth understanding now.
Your Phone (Termius) This is your window into the system. From anywhere — a coffee shop, a conference hall, your bed at 2am — you can SSH into your local machine and talk to Claude Code. You check status, give high-level instructions, and make decisions. You don't write code on your phone. You don't debug on your phone. You tell the AI what to do, and it does it.
Your Local Machine (Claude Code) This is the brain. Claude Code runs here, inside a persistent tmux session. It manages your git repositories, edits code, syncs files to the server, and coordinates everything. It connects to the GPU server over SSH and issues commands. When something breaks on the server, Claude Code is the one that detects it, diagnoses it, and fixes it. Your local machine doesn't need a GPU — it just needs a stable internet connection and the ability to run Claude Code.
Why does the local machine sit in the middle? Because Claude Code needs a persistent, always-on environment to run in, and your phone can't provide that. Your phone connects and disconnects. Your laptop opens and closes. But a desktop or a small home server, running tmux, is always there. It's the anchor of the entire system.
Your GPU Server (Training) This is where the actual computation happens. Training runs inside tmux sessions. A lightweight watchdog script monitors GPU utilization and process health. Conda environments hold your dependencies. The server doesn't know or care about Claude Code — it just receives SSH commands and runs them. All the intelligence lives on your local machine.
The data flows like this: you talk to Claude Code from your phone. Claude Code talks to the GPU server over SSH. When training crashes, the watchdog detects it, Claude Code reads the logs and fixes the problem, and you get a one-line status update the next time you check in.
Three nodes. Phone for control. Local machine for intelligence. Server for compute. Simple, robust, and it works whether you're in the lab or on a beach.
What You'll Build in This Guide
This guide walks you through building this system from scratch, one chapter at a time.
Chapter 2: Remote Access — You'll set up SSH keys, configure your connections, install Tailscale for mesh networking, and connect to your server from your phone using Termius. By the end, you can reach your machines from anywhere.
Chapter 3: Persistent Sessions — You'll learn tmux, the tool that makes everything else possible. Your processes survive disconnections, your sessions persist across reboots, and you never lose a training run to a dropped SSH connection again.
Chapter 4: Network & Proxy — You'll configure SSH multiplexing so multiple connections share a single tunnel, and set up port forwarding for servers that need a proxy to reach the outside world.
Chapter 5: Claude Code Setup — You'll install Claude Code, authenticate with the API, understand the costs, and get it running inside tmux so it's always available.
Chapter 6: Teaching Your AI — You'll write CLAUDE.md files that teach Claude Code about your specific servers, GPU layouts, conda environments, and research conventions. This is where a generic AI becomes your research assistant.
Chapter 7: Automation — You'll set up hooks, watchdog monitoring, and periodic health checks. This is the chapter where Claude Code goes from a tool you talk to into a system that acts on its own.
Chapter 8: Research Workflow — You'll see the complete pipeline: from idea to code to training to results, with Claude Code orchestrating every step. This is where everything comes together.
Chapter 9: First Experiment — You'll run a real experiment end-to-end. Claude Code writes the training script, syncs it to the server, launches training, monitors it, and delivers results — all while you watch from your phone.
Checkpoint
Draw the three-node architecture on a piece of paper. Label each node: Phone, Local Machine, GPU Server. Write down what each one does. Draw the arrows showing how data flows between them.
If you can explain why the local machine sits in the middle — why the phone doesn't connect directly to the GPU server for AI-assisted research — you understand the system. You'll build it in the next eight chapters.