First be sure to have a membership in the right scitas group: https://groups.epfl.ch
Also make sure to consult the SCITAS documentation was things might change.
Kuma has two GPU types. H100 and L40s. For Dr.TVAM the L40s are faster and cheaper:
ssh gaspar@kuma.hpc.epfl.ch
Create a file container.def:
Bootstrap: docker
From: nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04
%post -c /bin/bash
apt-get -y update
apt-get -y install libpython3-dev python3-setuptools python3-pip python3-venv git
python3 -m venv /opt/venv
source /opt/venv/bin/activate
pip install --upgrade pip
pip install numpy scipy matplotlib jax jaxopt tqdm jax[cuda12_pip] jaxlib notebook
pip install drtvam
%environment
export VIRTUAL_ENV=/opt/venv
export PATH="/opt/venv/bin:$PATH"
%runscript
#!/bin/bash
exec "$@"
And build the container with: srun --pty -p l40s -n 1 --cpus-per-task=8 --gpus-per-task=1 --qos=debug --time=00:10:00 apptainer build --force container.sif container.def
Create a file called jupyter.sh:
#!/bin/bash
# Default values
TIME="03:10:00"
GPU="h100"
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--time=*)
TIME="${1#*=}"
shift
;;
--time)
TIME="$2"
shift 2
;;
--gpu=*)
GPU="${1#*=}"
shift
;;
--gpu)
GPU="$2"
shift 2
;;
*)
shift
;;
esac
done
# Validate GPU option
if [[ "$GPU" != "h100" && "$GPU" != "l40s" ]]; then
echo "Error: Invalid GPU option. Use --gpu=h100 or --gpu=l40s"
exit 1
fi
# Create temporary SBATCH script
TEMP_FILE=$(mktemp temp_file_XXXXXX.sh)
cat > "$TEMP_FILE" << EOF
#!/bin/bash
#SBATCH --job-name=jax_jupyter
#SBATCH --output=job_output.log
#SBATCH --error=job_error.log
#SBATCH --partition=${GPU}
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --mem=30G
#SBATCH --time=${TIME}
#SBATCH --cpus-per-task=16
folder_name=\$(basename "\$(dirname "\$1")")
LOGFILE="\$(pwd)/logs/\$(date '+%Y-%m-%d_%H-%M-%S')_\$folder_name.log"
mkdir -p logs
# Run the command
ipnport=\$(shuf -i8000-9999 -n1)
apptainer run --bind /scratch/wechsler --nv /home/wechsler/container_jax.sif jupyter notebook --no-browser --port=\${ipnport} --ip=\$(hostname -i)
EOF
# Submit the job
sbatch "$TEMP_FILE"
# Clean up temp file after submission
rm "$TEMP_FILE"
And make it executable with chmod +x jupter.sh.
Then to run a job, call: ./jax_jupyter.sh --time=02:00:00 --gpu=h100.
Keep in mind that this costs per hour (roughly 0.5CHF)
To create the tunnel from your local machine to scitas, you can use the following script.
Create it and make it again executable and them call ./open_tunnel.sh:
#!/bin/bash
# Fetch last 10 lines from remote log file and extract the first URL
URL=$(ssh scitas "tail -n 10 job_error.log" | grep -oP 'http://10\.\d+\.\d+\.\d+:\d+[^\s]*' | head -n 1)
# Check if URL was found
if [[ -z "$URL" ]]; then
echo "Error: Could not find URL in log file"
exit 1
fi
echo "Found URL: $URL"
# Parse IP and PORT using regex
if [[ $URL =~ http://([0-9.]+):([0-9]+)(/.*) ]]; then
IP="${BASH_REMATCH[1]}"
PORT="${BASH_REMATCH[2]}"
PATH_AND_TOKEN="${BASH_REMATCH[3]}"
else
echo "Error: Invalid URL format"
exit 1
fi
# Start SSH tunnel in background
ssh -NL "$PORT:$IP:$PORT" scitas &
SSH_PID=$!
# Wait a moment for tunnel to establish
sleep 2
# Open Firefox with localhost URL
firefox "http://127.0.0.1:$PORT$PATH_AND_TOKEN" &
# Optional: Keep script running and cleanup on exit
trap "kill $SSH_PID 2>/dev/null" EXIT
wait $SSH_PID