Skip to content

EPFL-LAPD/SCITAS_utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

SCITAS

First be sure to have a membership in the right scitas group: https://groups.epfl.ch

Also make sure to consult the SCITAS documentation was things might change.

Kuma GPU cluster

Kuma has two GPU types. H100 and L40s. For Dr.TVAM the L40s are faster and cheaper:

ssh gaspar@kuma.hpc.epfl.ch 

Build a container

Create a file container.def:

Bootstrap: docker
From: nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04

%post -c /bin/bash
    apt-get -y update
    apt-get -y install libpython3-dev python3-setuptools python3-pip python3-venv git
    python3 -m venv /opt/venv
    source /opt/venv/bin/activate
    pip install --upgrade pip
    pip install numpy scipy matplotlib jax jaxopt tqdm jax[cuda12_pip] jaxlib notebook
    pip install drtvam

%environment
    export VIRTUAL_ENV=/opt/venv
    export PATH="/opt/venv/bin:$PATH"

%runscript
    #!/bin/bash
    exec "$@"

And build the container with: srun --pty -p l40s -n 1 --cpus-per-task=8 --gpus-per-task=1 --qos=debug --time=00:10:00 apptainer build --force container.sif container.def

Run a container

Create a file called jupyter.sh:

#!/bin/bash

# Default values
TIME="03:10:00"
GPU="h100"

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --time=*)
            TIME="${1#*=}"
            shift
            ;;
        --time)
            TIME="$2"
            shift 2
            ;;
        --gpu=*)
            GPU="${1#*=}"
            shift
            ;;
        --gpu)
            GPU="$2"
            shift 2
            ;;
        *)
            shift
            ;;
    esac
done

# Validate GPU option
if [[ "$GPU" != "h100" && "$GPU" != "l40s" ]]; then
    echo "Error: Invalid GPU option. Use --gpu=h100 or --gpu=l40s"
    exit 1
fi

# Create temporary SBATCH script
TEMP_FILE=$(mktemp temp_file_XXXXXX.sh)

cat > "$TEMP_FILE" << EOF
#!/bin/bash
#SBATCH --job-name=jax_jupyter
#SBATCH --output=job_output.log
#SBATCH --error=job_error.log
#SBATCH --partition=${GPU}
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --mem=30G
#SBATCH --time=${TIME}
#SBATCH --cpus-per-task=16

folder_name=\$(basename "\$(dirname "\$1")")
LOGFILE="\$(pwd)/logs/\$(date '+%Y-%m-%d_%H-%M-%S')_\$folder_name.log"

mkdir -p logs
# Run the command
ipnport=\$(shuf -i8000-9999 -n1)
apptainer run --bind /scratch/wechsler --nv /home/wechsler/container_jax.sif jupyter notebook --no-browser --port=\${ipnport} --ip=\$(hostname -i)
EOF

# Submit the job
sbatch "$TEMP_FILE"

# Clean up temp file after submission
rm "$TEMP_FILE"

And make it executable with chmod +x jupter.sh.

Then to run a job, call: ./jax_jupyter.sh --time=02:00:00 --gpu=h100.

Keep in mind that this costs per hour (roughly 0.5CHF)

Local helper script

To create the tunnel from your local machine to scitas, you can use the following script. Create it and make it again executable and them call ./open_tunnel.sh:

#!/bin/bash

# Fetch last 10 lines from remote log file and extract the first URL
URL=$(ssh scitas "tail -n 10 job_error.log" | grep -oP 'http://10\.\d+\.\d+\.\d+:\d+[^\s]*' | head -n 1)

# Check if URL was found
if [[ -z "$URL" ]]; then
    echo "Error: Could not find URL in log file"
    exit 1
fi

echo "Found URL: $URL"

# Parse IP and PORT using regex
if [[ $URL =~ http://([0-9.]+):([0-9]+)(/.*) ]]; then
    IP="${BASH_REMATCH[1]}"
    PORT="${BASH_REMATCH[2]}"
    PATH_AND_TOKEN="${BASH_REMATCH[3]}"
else
    echo "Error: Invalid URL format"
    exit 1
fi

# Start SSH tunnel in background
ssh -NL "$PORT:$IP:$PORT" scitas &
SSH_PID=$!

# Wait a moment for tunnel to establish
sleep 2

# Open Firefox with localhost URL
firefox "http://127.0.0.1:$PORT$PATH_AND_TOKEN" &

# Optional: Keep script running and cleanup on exit
trap "kill $SSH_PID 2>/dev/null" EXIT
wait $SSH_PID

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors