How to set up a new computational cluster
One of the most common questions I get from students just starting out in AI/ML research or computational biology is how to get set up on that new computational cluster they just got access to. Over the years, I’ve done this enough times that I started keeping a step-by-step guide I follow every time I get access to a new system to ensure I complete every step of my process. After getting a few requests to help with this in the last few weeks, I decided to publish my guide as a note here.
There are a few important things to keep in mind when reading this:
- These are all just preferences! There are many tools you can use and many ways to get set up on a cluster. This is mostly just what I prefer.
- This is not a static document and I don’t treat it as such! Over the years I’ve added quite a bit to it as my practices change or as I find new shortcuts, and with the incredible new abilities of modern AI tools, it’s been changing much more rapidly.
I hope you find this helpful or learn a trick or two. Additionally, if you have thoughts on a better way to do anything in this note, please don’t hesitate to tell me about it here! Cheers!
Step 0: Set up a coding agent of your choice
Before we do anything with the cluster, the number one piece of advice I can give is use a coding agent. They are amazingly helpful and can likely do a lot of the steps in this document for you if you give them this note as an input. My personal favorite is Claude Code! I would set this up on your local machine before we even get started. We’ll also set it up on the cluster itself in Step 4.
Step 1: Test your SSH connection
Before we start configuring anything, let’s make sure you can actually connect to the cluster. You’ll need two things: your username (the one assigned by the cluster, which may not be the same as your local machine’s username) and the cluster’s hostname. The hostname is the network address of the cluster’s login server, and it’s usually provided in the welcome email or onboarding docs you received when your account was created. Some examples from Stanford are sc.stanford.edu (SAIL) and login.sherlock.stanford.edu (Sherlock).
Throughout this doc, I’ll use a few different Stanford clusters as examples. Open a terminal on your local machine and try connecting:
# ssh <username>@<hostname>
ssh viggiano@login.sherlock.stanford.edu
You should be prompted for your password. Enter it, and if everything is working you’ll land in a shell on the cluster. Type exit to disconnect and come back to your local machine. If this doesn’t work, double check the hostname and username with whoever gave you access!
Now that we know the connection works, we can move on to getting set up!
Step 2: Add the cluster to your SSH config
Typing out ssh <username>@hostname every time you want to connect is really annoying, and hostnames are generally annoyingly hard to remember. Instead, we can define short nicknames for various clusters through the use of an SSH config file.
-
On your local machine, open (or create) the
~/.ssh/configfile:vim ~/.ssh/config -
Add an entry for the cluster you are getting set up that looks like this:
Host sherlock # <-- This is the nickname the cluster will use! HostName login.sherlock.stanford.edu # <-- The hostname User viggiano # <-- Your cluster username (may not be the same as your local username)
Now we can connect with just:
ssh sherlock
This nickname also works with scp, rsync, sftp, and any other tool that uses SSH under the hood. IDEs like VSCode/Cursor will also automatically pick up all of your configured hosts, so the cluster will appear in their Remote SSH dropdown as an option to connect to.
You can add as many entries as you want:
Host sumo
HostName sumo.stanford.edu
User viggiano
Host sail
HostName sc.stanford.edu
User viggiano
Step 3: Set up SSH keys
Next we will set up our ssh keys! Every time you connect to a cluster, the server needs to verify that you are who you say you are. By default, this means typing your password on every single connection. SSH keys let you skip that entirely. They work using a key pair: a private key that stays on your local machine and a public key that you give to the cluster. Once the cluster has your public key, it can verify your identity automatically. NOTE: Some clusters do not allow this step or handle it for you automatically. This is common on shared systems where admins enforce password-based or two-factor authentication for security. Notably, Stanford’s Sherlock cluster does not support SSH key authentication.
Quick Aside: Setting up SSH Keys
SSH keys live in the ~/.ssh/ directory on your local machine. You can check what’s there by running:
$ ls -al ~/.ssh/
If you already have ssh keys, you will most likely see something like the following:
-
id_ed25519is your private key. Never share this with anyone. -
id_ed25519.pubis your public key. This is what you copy to remote servers. -
known_hoststhis is an auto-maintained file that keeps track of servers you’ve connected to before.
If you don’t see any keys, you should create one using the ssh-keygen command:
ssh-keygen -t ed25519
(When running ssh-keygen, press Enter through the prompts to accept the default file location and skip the passphrase.)
Your public key is just a single line of text:
$ cat ~/.ssh/id_ed25519.pub
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... bviggiano@local-machine
Your private key will look like this and should never be shared:
$ cat ~/.ssh/id_ed25519
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAA...
...many more lines...
-----END OPENSSH PRIVATE KEY-----
Once you have a key pair, you can run ssh-copy-id from your local machine to automatically copy your public key to the cluster:
# ssh-copy-id <cluster-nickname>
ssh-copy-id sail
You’ll be asked for your password one last time. After that, you should be able to connect without it.
Step 4: Set up Claude Code
Claude Code is Anthropic’s CLI tool for working with Claude directly in your terminal. It can help you navigate unfamiliar codebases, debug issues, write scripts, and even run through the remaining steps in this guide interactively.
Some clusters have it pre-installed as a module. On Sherlock, for example, you can simply run:
module load claude-code
Otherwise, you can install it directly with a single command:
curl -fsSL https://claude.ai/install.sh | bash
Run claude to launch it and follow the authentication prompts. Once set up, you can ask it to help with anything from “set up my conda environment” to “why is my CUDA job failing.”
I highly recommend helping Claude get familiar with your new cluster by giving it links to the cluster’s documentation. Claude Code has a memory file called CLAUDE.md that it reads at the start of every session. You can ask it to save useful cluster details there so it remembers them for future conversations:
Hello Claude! You are currently being run on the <cluster-name-here> cluster. Please take a look at this documentation website <url-here> to get familiar with how the cluster is set up. Add a summary of what you find to your global CLAUDE.md on this system.
Step 5: Install Miniconda
Miniconda is a lightweight installer for conda, the package and environment manager most commonly used in the ML and data science communities. It lets you create isolated Python environments for different projects, which is essential on shared clusters where you don’t have root access. I also use alternatives such as Mamba (a faster drop-in replacement) and uv (a newer, Rust-based pip replacement), but I’d recommend using Miniconda to start since it’s the most widely used and is generally what most users already have set up.
Here is a step by step guide for getting Miniconda set-up:
- First, we need to download and run the installer on the cluster. The example below is for Linux x86_64, which is the most common cluster architecture. You can check yours by running
uname -mon the cluster. If it’s something other thanx86_64, find the right installer at the Miniconda downloads page.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
-
The installer will show you a very long license agreement. Scroll through (hold Enter) and accept at the end with
y. -
You will see the following message:
Miniconda3 will now be installed into this location:
/sailhome/viggiano/miniconda3
- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below
[/sailhome/viggiano/miniconda3] >>> /scr-ssd/viggiano/miniconda3
-
When prompted “Do you wish to update your shell profile to automatically initialize conda?”, select
yes. This ensures conda is available every time you open a new terminal. -
Miniconda is now installed. We can now remove the installer file
rm Miniconda3-latest-Linux-x86_64.sh
You can now create new environments via:
conda create -n env-name python=3.10
Step 6: Manage storage
Most clusters give your home directory (~) a limited quota. For example, Sherlock limits $HOME to just 15 GB. When you use packages like HuggingFace Transformers, PyTorch, or Weights & Biases, model weights and cache files are automatically downloaded to subdirectories of your home directory (e.g., ~/.cache/huggingface/, ~/.cache/torch/, ~/.cache/wandb/), which can fill up your quota quickly.
Clusters typically offer multiple storage locations, and it’s important to understand what your options are so you can place files accordingly.
On Sherlock, the main options are:
| Filesystem | Path | Speed | Quota |
|---|---|---|---|
Home ($HOME) | /home/users/$USER/ | Fast | 15 GB |
Group Home ($GROUP_HOME) | /home/groups/<PI>/ | Fast | 1 TB shared |
Scratch ($SCRATCH) | /scratch/users/$USER/ | Fast | Large (auto-purges after 90 days if you don’t use the file) |
Oak ($OAK) | /oak/stanford/groups/<PI>/ | Slow | Large |
Your cluster will have its own storage layout; check with your admin or docs. The key principle is the same: if you have limited storage, keep large, growing directories off of $HOME. Shared storage locations usually have a convention for how users organize their directories. Take a look at what others on your cluster have done (e.g., ls $GROUP_HOME) and create a directory for yourself following the same pattern. A common convention is to use your username (e.g., mkdir $GROUP_HOME/$USER).
Set up Symlinks for Cache locations
A symlink (symbolic link) is a special file that acts as a pointer to another location on the filesystem. To any program reading or writing files, a symlink looks and behaves exactly like a normal directory; the difference is that the actual data lives elsewhere. This means you can keep ~/.cache in your home directory as far as any tool is concerned, but the files are really stored on a larger filesystem:
Here are some common examples of folders I’d recommend setting up symlink locations for:
# ln -s <target-on-larger-storage> <path-in-home>
ln -s $GROUP_HOME/viggiano/.cache ~/.cache
ln -s $GROUP_HOME/viggiano/.local ~/.local
ln -s $GROUP_HOME/viggiano/.conda ~/.conda
ln -s $GROUP_HOME/viggiano/.cursor-server ~/.cursor-server
ln -s $GROUP_HOME/viggiano/.vscode-server ~/.vscode-server
You can verify the symlinks are set up correctly by running ls -al ~. You should see something like this:
$ ls -al ~
lrwxrwxrwx 1 viggiano viggiano 35 Mar 29 12:00 .cache -> /home/groups/mylab/viggiano/.cache
lrwxrwxrwx 1 viggiano viggiano 35 Mar 29 12:00 .local -> /home/groups/mylab/viggiano/.local
lrwxrwxrwx 1 viggiano viggiano 35 Mar 29 12:00 .conda -> /home/groups/mylab/viggiano/.conda
lrwxrwxrwx 1 viggiano viggiano 42 Mar 29 12:00 .cursor-server -> /home/groups/mylab/viggiano/.cursor-server
lrwxrwxrwx 1 viggiano viggiano 42 Mar 29 12:00 .vscode-server -> /home/groups/mylab/viggiano/.vscode-server
The -> arrow tells you each directory is a symlink pointing to its target on the larger filesystem.
Redirect where conda environments are saved
Conda environments can also get pretty big quickly. You can tell conda to store them on a larger filesystem by default:
conda config --prepend envs_dirs /home/groups/<PI>/viggiano/envs
Now conda create -n my-env python=3.10 will create the environment on the larger filesystem, and conda activate my-env still works by name from anywhere.
Other useful conda config commands
- Show all configured environment directories:
conda config --show envs_dirs
- Add a directory to the end of the list (lowest priority):
conda config --append envs_dirs /path/to/your/new/envs/location
- Remove a specific directory from the list:
conda config --remove envs_dirs /path/to/remove
Step 7: Customize shell functionality
One of the most powerful things I have started doing is writing custom shell functions that automate repetitive tasks. Your ~/.bashrc (or ~/.zshrc) runs every time you open a new shell session, making it the perfect place to add aliases, helper functions, and environment variables that make your life easier on the cluster. Here are a few I’ve found really useful.
.bashrc functions
create_workspace_file - Generate a VS Code/Cursor workspace file for the current directory
Setting up workspace files for VS Code/Cursor by hand is tedious. This function generates a .code-workspace file for your current directory and saves it to ~/workspaces. You can then open it directly from the terminal with cursor ~/workspaces/my-project.code-workspace.
Show function
create_workspace_file() {
local current_dir=$(basename "$PWD")
local current_path="$PWD"
local target_dir="$HOME/workspaces"
mkdir -p "$target_dir"
local workspace_file="$target_dir/$current_dir.code-workspace"
cat > "$workspace_file" <<EOL
{
"folders": [
{
"path": "$current_path"
}
],
"settings": {}
}
EOL
echo "Workspace file created at: $workspace_file"
}
gpu-users - See who is using which GPUs
On clusters that don’t use SLURM for GPU allocation (where users share GPUs directly), it can be helpful to see which users are running processes on which GPUs before you launch a job. This function displays which users currently have processes using the various visible GPUs.
Show function
gpu-users() {
echo -e "GPU\tPID\tUSER\tPROCESS"
nvidia-smi --query-compute-apps=gpu_uuid,pid,process_name --format=csv,noheader,nounits |
while IFS=',' read -r gpu_uuid pid pname; do
pid=$(echo "$pid" | tr -d '[:space:]')
[ -z "$pid" ] && continue
if [[ "$pid" =~ ^[0-9]+$ ]]; then
user=$(ps -o user= -p "$pid" 2>/dev/null)
if [ -n "$user" ]; then
gpu_id=$(nvidia-smi --query-gpu=gpu_uuid,index --format=csv,noheader | grep "$gpu_uuid" | awk -F',' '{print $2}' | xargs)
echo -e "$gpu_id\t$pid\t$user\t$pname"
fi
fi
done | sort -k1,1n
}
tmux
tmux is a terminal multiplexer that lets you run persistent sessions on the cluster. If your SSH connection drops, your tmux session keeps running and you can reattach to it. I’d recommend adding the following to ~/.tmux.conf to enable mouse scrolling and selection:
# Enable mouse support (allows clicking and dragging to select)
set -g mouse on
Basic tmux commands
| Command | Description |
|---|---|
tmux new -s mysession | Create a new named session |
tmux attach -t mysession | Reattach to an existing session |
tmux ls | List all active sessions |
Ctrl+b d | Detach from current session (session keeps running) |
Ctrl+b c | Create a new window within a session |
Ctrl+b n / Ctrl+b p | Switch to next / previous window |
Ctrl+b % | Split pane vertically |
Ctrl+b " | Split pane horizontally |
Ctrl+b arrow keys | Switch between panes |
exit | Close the current pane or window |
A typical workflow: start a tmux session, kick off a long-running training job, detach with Ctrl+b d, close your laptop, and reattach later with tmux attach -t mysession to check on progress.
Launch scripts (Advanced)
Once you’re comfortable with tmux, a natural next step is writing small launch scripts that combine tmux with resource allocation. On clusters that use SLURM, I’ll often write a script that requests GPU resources and drops me into a tmux session in one command. Here’s an example:
Example: GPU launch script
#!/usr/bin/env bash
# Usage: launch_gpu [NUM_GPUS] [PARTITION]
# Example: launch_gpu 2 gpu
NUM_GPUS="${1:-1}"
PARTITION="${2:-gpu}"
# Auto-name sessions: gpu_session_1, gpu_session_2, etc.
SESSION_COUNT=$(tmux list-sessions 2>/dev/null | wc -l)
SESSION_NAME="gpu_session_$((SESSION_COUNT + 1))"
# Create a tmux session and request an interactive SLURM job inside it
tmux new-session -d -s "$SESSION_NAME"
tmux send-keys -t "$SESSION_NAME" "srun --partition=$PARTITION --gpus=$NUM_GPUS --cpus-per-task=8 --mem=32G --time=8:00:00 --pty bash" C-m
echo "Launched session: $SESSION_NAME"
tmux attach -t "$SESSION_NAME"
This script auto-names each session, requests an interactive GPU job via SLURM, and attaches you to the tmux session. Adjust the partition name, memory, CPU, and time limits to match your cluster. Save it somewhere on your PATH (e.g., ~/.local/bin/launch_gpu) and make it executable with chmod +x.
Heads up: The resource values above (GPUs, CPUs, memory, time) are just defaults. Before running something like this, talk to your advisor or ask Claude Code about what resources are appropriate for your specific workflow and cluster. Requesting too much wastes shared resources; requesting too little and your job will fail or run slowly.
Step 8: Git and GitHub setup
You’ll almost certainly want to push and pull code on the cluster. This requires authenticating with GitHub from the cluster, which is a separate step from the SSH key setup in Step 3 (that was for connecting to the cluster; this is for connecting from the cluster to GitHub).
Install the GitHub CLI
The GitHub CLI (gh) is one of my favorite tools. It lets you interact with GitHub entirely from the terminal: create PRs, browse repos, review issues, manage workflows, and more. Most importantly for our purposes, it handles git authentication seamlessly so you never have to manually deal with tokens or credentials when pushing and pulling.
The easiest way to install is with brew install gh, but you usually don’t have permission to use brew on a cluster. I’d recommend downloading the binary directly and placing it on your PATH. This way gh is always available regardless of which conda environment is active:
# Check https://github.com/cli/cli/releases for the latest version
wget https://github.com/cli/cli/releases/download/v2.89.0/gh_2.89.0_linux_amd64.tar.gz
tar xzf gh_*.tar.gz
# Move the binary somewhere on your PATH
mkdir -p ~/.local/bin
cp gh_*/bin/gh ~/.local/bin/
# Clean up
rm -rf gh_*
Make sure ~/.local/bin is on your PATH by adding this to your ~/.bashrc if it isn’t already:
export PATH="$HOME/.local/bin:$PATH"
Alternative: install via conda
You can also install gh through conda, but note that it will only be available when that specific environment is active:
conda activate base
conda install gh --channel conda-forge
Authenticate with GitHub
The gh CLI can handle authentication for you. Run:
gh auth login
Select HTTPS when prompted, then choose “Paste an authentication token” and generate a new token at github.com/settings/tokens. Once authenticated, run:
gh auth setup-git
This configures git to use gh as a credential helper, so you won’t be asked for credentials when pushing or pulling.
Alternative: SSH authentication with GitHub
The HTTPS approach above is all you need, but if you prefer using git@github.com:... SSH URLs instead, you can add your SSH key to GitHub. Generate one on the cluster if you don’t have one already:
ssh-keygen -t ed25519 -C "viggiano@stanford.edu"
Copy your public key:
cat ~/.ssh/id_ed25519.pub
Then add it at github.com/settings/keys by clicking “New SSH Key” and pasting the contents. Test the connection:
ssh -T git@github.com
Configure git
Set your identity and preferences. Here are some I like to set:
git config --global user.email "viggiano@stanford.edu"
git config --global user.name "Ben Viggiano"
git config --global init.defaultBranch main
git config --global core.editor "vim"
Set up a global .gitignore
A global gitignore lets you exclude files across all repos without adding them to each project’s .gitignore. Create the file:
vim ~/.gitignore_global
Example ~/.gitignore_global:
dev_notebook.ipynb
Then tell git to use it:
git config --global core.excludesfile '~/.gitignore_global'
Step 9: Authenticate with ML services
If you’re doing ML work, you’ll likely need to authenticate with a few external services to download models or track experiments.
HuggingFace
HuggingFace is the largest open-source hub for ML models, datasets, and tools. Many popular models (like Llama, Mistral, and Gemma) are “gated,” meaning you need to create a HuggingFace account and accept the model’s license on its page before you can download it. Once you’ve done that, you’ll need to authenticate on the cluster.
Install the HuggingFace CLI and log in:
curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login
Alternatively, you can set the token directly in your environment (useful for non-interactive scripts):
export HF_TOKEN=hf_...
You can generate a token at huggingface.co/settings/tokens. Verify it worked:
huggingface-cli whoami
Weights & Biases
Weights & Biases (W&B) is a popular platform for tracking ML experiments. It logs your training metrics, hyperparameters, and outputs to a web dashboard, making it easy to compare runs and share results with collaborators. Install and log in:
pip install wandb
wandb login
You’ll be prompted for your API key, which you can find at wandb.ai/authorize.
Summary
That’s it! Here’s a quick recap of everything we covered:
| Step | What | Why |
|---|---|---|
| 0 | Set up a coding agent | Helps you through the rest of the steps |
| 1 | Test your SSH connection | Make sure you can actually reach the cluster |
| 2 | Add cluster to SSH config | Connect with a short nickname instead of the full hostname |
| 3 | Set up SSH keys | Skip password entry on every connection |
| 4 | Set up Claude Code | AI assistance directly on the cluster |
| 5 | Install Miniconda | Manage Python environments without root access |
| 6 | Manage storage | Keep large files off your limited home directory |
| 7 | Customize your shell | Automate repetitive tasks with shell functions |
| 8 | Git and GitHub setup | Push and pull code from the cluster |
| 9 | Authenticate with ML services | Download gated models and track experiments |
If you have suggestions for improving this guide, please open an issue!