Deploying and interacting with Tesseracts on HPC clusters using tesseract-runtime serve

Running Tesseracts on high-performance computing clusters can have many use cases including:

  • Deployment of a single long-running component of a pipeline on a state-of-the-art GPU.

  • Running an entire optimization workflow on a dedicated compute node

  • Large parameter scans distributed in parallel over a multitude of cores.

All of this is possible even in scenarios where containerisation options are either unavailable or incompatible, thanks to the alternative container-free tesseract-runtime CLI tool (we will explore containerised options in the future). In this tutorial, we will explore how to launch uncontainerised Tesseracts using SLURM, either as a batch job or for interactive use. For convenience we shall use our ready-built FEM shape-optimisation demo.

General set up

:warning: WARNING: These instructions assume that conda is available through module load, if this is not the case you will need to look at other supported methods to install conda/an alternative such as pyenv).

  1. Login to your cluster (e.g. via SSH)

  2. Setup Python:

    $ module load conda
    $ conda create --name tsrctjax-3.13 python=3.13
    $ conda activate tsrctjax-3.13
    
  3. Install tesseract-core[runtime] and tesseract-jax via pip

    (tsrctjax-3.13)$ pip install tesseract-core[runtime] tesseract-jax
    
  4. Make sure your environment is ready for you each time you login or switch node by adding the following to your .bashrc:

    $ echo "module load conda" >> ~/.bashrc
    $ echo "conda activate tsrctjax-3.13" >> ~/.bashrc
    
  5. Create a new directory in $SCRATCH (or analogous appropriate directory depending on your setup) in which to store run information.

    $ export RUNDIR=$SCRATCH/fem-demo
    $ mkdir $RUNDIR
    

Download the example Tesseract API’s and optimisation script

The design and FEM Tesseracts used in the FEM shape optimisation demo are specified within the examples/fem-shapeopt folder of the Tesseract-JAX repo, to access these we simply clone the repo.

$ git clone git@github.com:pasteurlabs/tesseract-jax.git

We will also make use of a simple optimization script opt.py.zip (1.6 KB) that runs the problem shown in demo.ipynb but using a couple of iterations of L-BFGS (through scipy.optimize.minimize) instead of optax’s SGD. Here we will assume this is located in $HOME/fem-demo.

Create conda environments for each Tesseract

As the FEM shape-optimisation Tesseracts have incompatible dependencies, we need to install our dependencies in separate conda environments for each Tesseract being sure to additionally install tesseract-core[runtime] to enable serving through the tesseract-runtime CLI command.

  1. The design Tesseract specifies its dependencies through a tesseract_requirements.txt file, we will need to create a new environment to install these dependencies in.

    $ export DESIGN_TESS_PATH=tesseract-jax/examples/fem-shapeopt/design_tess
    $ conda create --name design-tess-env python=3.12 --file $DESIGN_TESS_PATH/tesseract_requirements.txt
    $ conda activate design-tess-env
    $ pip install tesseract-core[runtime]
    
  2. The FEM Tesseract already defines its requirements through tesseract_environment.yaml, we can set up the required jax-fem-env environment using:

    $ export FEM_TESS_PATH=tesseract-jax/examples/fem-shapeopt/fem_tess
    $ conda env create -f $FEM_TESS_PATH/tesseract_environment.yaml
    $ conda activate jax-fem-env
    $ pip install tesseract-core[runtime]
    

Method 1: Serving Tesseracts with SLURM and then querying from login node or home machine using an SSH tunnel

Interacting with live Tesseracts can be useful for multiple purposes such as experimentation using a Jupyter notebook on your home device, or building heterogenous (i.e. multi-device) pipelines. Serving Tesseracts with a SLURM script is as simple as activating the relevant conda environment environment and running TESSERACT_API_PATH=${TESS_PATH}/tesseract_api.py tesseract-runtime serve. The main consideration is that the URL that the Tesseract will be served to is not known until a node is assigned, therefore we need to ensure we return this information in our output file. We provide three examples below of different configurations:

Background processes on single node—most appropriate for standard case of synchronous, sequential pipelines.
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --time=00:30:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Get compute node info
NODE_HOSTNAME=$(hostname)

# Ensure early exit on first error
set -e

# Activate conda environments and serve as background processes
conda activate design-tess-env
TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000 & DESIGN_JOB_PID=$!
conda activate jax-fem-env
TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001 & FEM_JOB_PID=$!

echo "=== Serving Tesseracts on ${NODE_HOSTNAME} ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

wait $DESIGN_JOB_PID
DESIGN_JOB_EXIT_CODE=$?
wait $FEM_JOB_PID
FEM_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
echo "Design tess exit code: $DESIGN_JOB_EXIT_CODE"
echo "FEM tess exit code: $FEM_JOB_EXIT_CODE"
MPI tasks assigned to distinct cores of a single node—helpful if queueing a batch of queries so that the second query can begin as soon as the first Tesseract has completed the first query.
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=128
#SBATCH --time=00:30:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Get compute node info
NODE_HOSTNAME=$(hostname)

# Ensure early exit on first error
set -e

# Activate conda environments and serve as background processes
srun --ntasks=1 bash -c "conda activate design-tess-env && TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000" & DESIGN_JOB_PID=$!
srun --ntasks=1 bash -c "conda activate jax-fem-env && TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001"& FEM_JOB_PID=$!

echo "=== Serving Tesseracts on ${NODE_HOSTNAME} ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

wait $DESIGN_JOB_PID
DESIGN_JOB_EXIT_CODE=$?
wait $FEM_JOB_PID
FEM_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
echo "Design tess exit code: $DESIGN_JOB_EXIT_CODE"
echo "FEM tess exit code: $FEM_JOB_EXIT_CODE"
Serving each Tesseract on its own node—could be useful in supporting heterogenous CPU/GPU pipelines (although this should be possible on a single GPU node)
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --time=00:30:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Ensure early exit on first error
set -e

# Get node list
NODES=($(scontrol show hostname $SLURM_JOB_NODELIST))
if [ ${#NODES[@]} -ne 2 ]; then
    echo "ERROR: Expected 2 nodes, got ${#NODES[@]}"
    exit 1
fi

DESIGN_NODE=${NODES[0]}
FEM_NODE=${NODES[1]}

# Activate conda environments and serve as background processes
srun --nodes=1 --ntasks=1 --nodelist=$DESIGN_NODE bash -c "conda activate design-tess-env && TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000" & DESIGN_JOB_PID=$!
srun --nodes=1 --ntasks=1 --nodelist=$FEM_NODE bash -c "conda activate jax-fem-env && TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001"& FEM_JOB_PID=$!

echo "=== Serving Tesseracts on separate nodes ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

wait $DESIGN_JOB_PID
DESIGN_JOB_EXIT_CODE=$?
wait $FEM_JOB_PID
FEM_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
echo "Design tess exit code: $DESIGN_JOB_EXIT_CODE"
echo "FEM tess exit code: $FEM_JOB_EXIT_CODE"

This text will be hidden

Upon running any of the above scripts with sbatch method1-script.sh the uncontainerised design and FEM Tesseracts will be served to the URL’s confirmed in the output file (which defaults to slurm-%j.out where %j denotes the job number). They can be queried directly from a login/interactive node through curl on the command line or accessed interactively through the tesseract-core Python API using Tesseract.from_url(f"http://{node_id}:{port}"). For example, (assuming a single node ID nid001234), we can query the status of the processes and then run the optimization scripts which passes command line arguments directly to Tesseract.from_url:

$ # Ensure tsrctjax-3.13 environment is re-activated
$ conda activate tsrctjax-3.13
(tsrctjax-3.13) $  curl http://nid001234:8000/health
{"status":"ok"}(tsrctjax-3.13) $ curl http://nid001234:8001/health
{"status":"ok"}(tsrctjax-3.13) $ python fem-demo/opt.py http://nid001234:8000 http://nid001234:8001
WARNING:2025-07-16 07:47:48,636:jax._src.xla_bridge:794: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
opt.py script completed
  message: STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT
  success: False
   status: 1
      fun: 2016.088134765625
        x: [-1.400e+01 -1.396e+01 ... -4.314e+00 -1.880e+00]
      nit: 2
      jac: [ 2.082e+02 -1.539e+00 ...  5.909e+01  5.163e+01]
     nfev: 3
     njev: 3
 hess_inv: <24x24 LbfgsInvHessProduct with dtype=float64>

The endpoints of the served Tesseracts can even be accessed on your local machine by setting up SSH tunnels :shovel:

$ ssh -f -N -L 8000:nid001234:8000 username@cluster.extension
$ curl http://localhost:8000/health
{"status":"ok"}$ ssh -f -N -L 8001:nid001234:8001 username@cluster.extension
$ curl http://localhost:8001/health
{"status":"ok"}$

This way you can use your local workflow setup (e.g. Jupyter notebook with your favourite VSCode plugins and colour schemes) to run and analyse pipelines of Tesseracts served on HPC clusters! :rocket:

Method 2: Serving Tesseracts and running pipeline optimisation script with SLURM

We can also run the optimisation script as part of the batch process. However, this is a bit more involved as we need to ensure the servers have started before running the script and then exit cleanly when the script completes. While there may be more sophisticated ways to achieve this, for this example we achieved the former by repeatedly searching the output file for confirmation that the uvicorn process is running on the expected port. The latter is achieved by removing the wait commands for the Tesseracts and only waiting for the script to complete. Again we provide three different configurations as examples. The final results will be output to ${RUNDIR}/results-${SLURM_JOB_ID}.txt.

:warning: WARNING: The below scripts makes an explicit assumption that the default output file name is used and will need to be edited if this is overridden.

Background processes on single node
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Get compute node info
NODE_HOSTNAME=$(hostname)

# Ensure early exit on first error
set -e

# Activate conda environments and serve as background processes
conda activate design-tess-env
TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000 & DESIGN_JOB_PID=$!
conda activate jax-fem-env
TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001 & FEM_JOB_PID=$!

# Wait for server with timeout
timeout=120
count=0
while { ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out || ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; } && [ $count -lt $timeout ]; do
    sleep 1
    count=$((count + 1))
done

if ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: Design Tesseract did not start in time"
    exit 1
fi

if ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: FEM Tesseract did not start in time"
    exit 1
fi

echo "=== Tesseracts served successfully on node ${NODE_HOSTNAME} ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

echo "=== Starting optimisation script ==="
conda activate tsrctjax-3.13
python ~/fem-demo/opt.py http://${NODE_HOSTNAME}:8000 http://${NODE_HOSTNAME}:8001 > ${RUNDIR}/results-${SLURM_JOB_ID}.txt
CLIENT_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
if [ $CLIENT_JOB_EXIT_CODE -eq 0 ]; then
    echo "Script completed successfully, results written to ${RUNDIR}/results-${SLURM_JOB_ID}.txt."
    exit 0
else
    echo "FAILURE: Optimisation script did not complete successfully for some reason."
    exit 1
fi
MPI tasks assigned to distinct cores of a single node
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --time=00:10:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Get compute node info
NODE_HOSTNAME=$(hostname)

# Ensure early exit on first error
set -e

# Activate conda environments and serve as background processes
srun --ntasks=1 --cpus-per-task=120 bash -c "conda activate design-tess-env && TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000" & DESIGN_JOB_PID=$!
srun --ntasks=1 --cpus-per-task=120 bash -c "conda activate jax-fem-env && TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001"& FEM_JOB_PID=$!

# Wait for server with timeout
timeout=120
count=0
while { ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out || ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; } && [ $count -lt $timeout ]; do
    sleep 1
    count=$((count + 1))
done

if ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: Design Tesseract did not start in time"
    exit 1
fi

if ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: FEM Tesseract did not start in time"
    exit 1
fi

echo "=== Tesseracts served successfully on node ${NODE_HOSTNAME} ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

echo "=== Starting optimisation script ==="
srun --ntasks=1 --cpus-per-task=8 bash -c "python ~/fem-demo/opt.py http://${NODE_HOSTNAME}:8000 http://${NODE_HOSTNAME}:8001 > ${RUNDIR}/results-${SLURM_JOB_ID}.txt"
CLIENT_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
if [ $CLIENT_JOB_EXIT_CODE -eq 0 ]; then
    echo "Script completed successfully, results written to ${RUNDIR}/results-${SLURM_JOB_ID}.txt."
    exit 0
else
    echo "FAILURE: Optimisation script did not complete successfully for some reason."
    exit 1
fi
Multi-node
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=3
#SBATCH --ntasks=3
#SBATCH --time=00:10:00
#SBATCH --licenses=scratch
#SBATCH --constraint=cpu
#SBATCH --account=your-account

# Ensure early exit on first error
set -e

# Get node list
NODES=($(scontrol show hostname $SLURM_JOB_NODELIST))
if [ ${#NODES[@]} -ne 3 ]; then
    echo "ERROR: Expected 3 nodes, got ${#NODES[@]}"
    exit 1
fi

DESIGN_NODE=${NODES[0]}
FEM_NODE=${NODES[1]}
CLIENT_NODE=${NODES[2]}

# Activate conda environments and serve as background processes
srun --nodes=1 --ntasks=1 --nodelist=$DESIGN_NODE bash -c "conda activate design-tess-env && TESSERACT_API_PATH=${DESIGN_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8000" & DESIGN_JOB_PID=$!
srun --nodes=1 --ntasks=1 --nodelist=$FEM_NODE bash -c "conda activate jax-fem-env && TESSERACT_API_PATH=${FEM_TESS_PATH}/tesseract_api.py tesseract-runtime serve --host 0.0.0.0 --port 8001"& FEM_JOB_PID=$!

# Wait for server with timeout
timeout=120
count=0
while { ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out || ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; } && [ $count -lt $timeout ]; do
    sleep 1
    count=$((count + 1))
done

if ! grep -q 0.0.0.0:8000 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: Design Tesseract did not start in time"
    exit 1
fi

if ! grep -q 0.0.0.0:8001 slurm-${SLURM_JOB_ID}.out; then
    echo "ERROR: FEM Tesseract did not start in time"
    exit 1
fi

echo "=== Tesseracts served successfully on separate nodes ==="
echo "Serving Design Tesseract at http://${NODE_HOSTNAME}:8000, PID: $DESIGN_JOB_PID"
echo "Serving FEM Tesseract at http://${NODE_HOSTNAME}:8001, PID: $FEM_JOB_PID"

echo "=== Starting optimisation script ==="
srun --nodes=1 --ntasks=1 --nodelist=$CLIENT_NODE bash -c "python ~/fem-demo/opt.py http://${DESIGN_NODE}:8000 http://${FEM_NODE}:8001 > ${RUNDIR}/results-${SLURM_JOB_ID}.txt"
CLIENT_JOB_EXIT_CODE=$?

echo "=== Exit status ==="
if [ $CLIENT_JOB_EXIT_CODE -eq 0 ]; then
    echo "Script completed successfully, results written to ${RUNDIR}/results-${SLURM_JOB_ID}.txt."
    exit 0
else
    echo "FAILURE: Optimisation script did not complete successfully for some reason."
    exit 1
fi
2 Likes