Hosting LLMs with vLLM

This guide explains how to run a vLLM inference server on Sol and connect to it from your local machine. vLLM serves large language models as an OpenAI-compatible API endpoint on our A100 and Gaudi2 accelerators, so you can integrate them into your code. If you only need an API key (for popular open source models), please follow this guide to request a free key and avoid spending your fairshare on Sol.

caution

Use the mamba module or apptainers (see below). Do not use uv, conda, or pip, or docker.

Quick reference: which accelerator do I need?

All recommendations assume bf16 precision (no quantization). Note that A100s on Sol are always in high-demand, and will lead to a large deduction to your fairshare. A100 MIG instances do not support the peer-to-peer GPU communication needed for tensor parallelism, and is too small for practical LLM serving. We highly recommend using the API keys instead of hosting your own.

Click here to expand...

NVIDIA A100

Model	Parameters	GPU requirement	Max context	Notes
Qwen2.5-7B-Instruct	7B	1x A100 40G	32,768	Good starting point
Phi-4	14B	1x A100 80G	16,384	Needs more than 40G GPU memory
Qwen2.5-32B-Instruct	32B	2x A100 40G or 1x A100 80G	32,768	tp=2 on 40G cards, tp=1 on 80G
Qwen2.5-72B-Instruct	72B	4x A100 80G	32,768	~145 GiB weights, tight fit
Qwen3.5-27B	27B	2x A100 80G	131,072	Requires advanced JIT setup (see below)

Intel Gaudi2 (96 GB HBM2e per card)

Model	Parameters	HPU requirement	Max context	Notes
Qwen2.5-7B-Instruct	7B	1x Gaudi2	8,192	Good starting point
Qwen2.5-32B-Instruct	32B	2x Gaudi2	8,192	tp=2
Qwen2.5-72B-Instruct	72B	8x Gaudi2	8,192	tp=8

Apptainer

Due to the complexity, we recommend using an apptainer container to run vllm, instead of installing vllm in a python environment. For additional information about apptainer, please check here.

For A100 (NVIDIA): the official vLLM repository on DockerHub has the newest releases. To download one:

interactive -t 0-2 -c 4 --mem=20GB -p htc -q public
apptainer pull docker://vllm/vllm-openai:cu129-nightly

For Gaudi2 (Intel HPU): the upstream vLLM image does not support Gaudi. Use Intel Habana's prebuilt vLLM image instead, published at vault.habana.ai. The Gaudi job scripts below (Option B) pull this image automatically on first run, so you normally don't need to do anything here. If you'd rather pre-stage it manually, run the following on an interactive job (not the login node) to pull vLLM 0.21.0 (the Habana ptfork build):

interactive -t 0-2 -c 4 --mem=20GB -p gaudi -q public
apptainer pull /scratch/$USER/vllm-gaudi-0.21.0.sif \
    docker://vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptfork-2.10.0:1.24.0-1007

note

The Habana image is large (~25 GB). It is stored in your scratch directory and reused across jobs. The Gaudi job scripts expect it at /scratch/$USER/vllm-gaudi-0.21.0.sif. Don't pull it on the login node — it's slow and competes with other users.

A100 Examples

Each example below starts a vLLM server, writes connection info to a file, and keeps running until the job ends or is cancelled.

Small model on 1x A100 40G

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:1
#SBATCH -C a100_40
#SBATCH -c 16
#SBATCH --mem=64G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# path to the apptainer SIF you pulled earlier, or use a public image
SIF=/packages/apps/simg/vllm-cu129-nightly-0426.sif

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

# start vLLM via apptainer
apptainer run --nv \
    --bind /scratch \
    --env PYTORCH_ALLOC_CONF=$PYTORCH_ALLOC_CONF \
    --env HF_HOME=$HF_HOME \
    $SIF \
    --model Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --port $PORT

Large model on 2x A100 80G (tensor parallelism)

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -C a100_80
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# path to the apptainer SIF you pulled earlier, or use a public image
SIF=/packages/apps/simg/vllm-cu129-nightly-0426.sif

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

apptainer run --nv \
    --bind /scratch \
    --env PYTORCH_ALLOC_CONF=$PYTORCH_ALLOC_CONF \
    --env HF_HOME=$HF_HOME \
    $SIF \
    --model Qwen/Qwen2.5-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --port $PORT

Advanced: Qwen3.5-27B on A100 (GDN architecture)

Qwen3.5-27B uses a newer attention mechanism called GDN (Gated Delta Networks) that is not pre-compiled in the vLLM package. Instead, a library called flashinfer compiles GPU kernels at runtime using nvcc and g++. The vLLM apptainer container already bundles a working CUDA toolchain and gcc, so the only extra step compared to the standard examples is bind-mounting ~/.cache/flashinfer so the JIT-compiled kernels persist across jobs. CUDA 12.9 is used in the apptainer, instead of CUDA 13.0 because CUDA 13.0's stricter C++17 dialect requirements break flashinfer's JIT compilation.

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -C a100_80
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# path to the apptainer SIF you pulled earlier, or use a public image
SIF=/packages/apps/simg/vllm-cu129-nightly-0426.sif

# make sure the flashinfer JIT cache directory exists on the host so it
# can be bind-mounted into the container and persist across jobs
mkdir -p $HOME/.cache/flashinfer

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

apptainer run --nv \
    --bind /scratch \
    --bind $HOME/.cache/flashinfer \
    --env PYTORCH_ALLOC_CONF=$PYTORCH_ALLOC_CONF \
    --env HF_HOME=$HF_HOME \
    $SIF \
    --model Qwen/Qwen3.5-27B \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --enforce-eager \
    --reasoning-parser qwen3 \
    --language-model-only \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --gpu-memory-utilization 0.95 \
    --port $PORT

note

First request will be slow. The GDN kernel takes 1–3 minutes to compile on the first inference request. Subsequent requests will be fast, pre-compiled kernels will be cached in ~/.cache/flashinfer/.

Qwen3.5-27B Thinking mode is enabled by default with --reasoning-parser qwen3. The model's chain-of-thought reasoning is returned in the reasoning_content field of the API response. To disable thinking for simpler tasks:

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Say hello"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Gaudi2 Examples

Intel Gaudi2 accelerators, "HPUs", have 96 GB of HBM2e memory per card and are a good alternative to A100s for supported models. We have benchmarked that the HPUs outperform A100s in many use cases.

There are two ways to run vLLM on Gaudi2. The shared mamba environment is the simplest: just module load and source activate, then submit — nothing to download ahead of time. The apptainer container is an alternative if you need a specific vLLM/Habana version or want a fully self-contained image.

caution

On Gaudi2 you must set export PT_HPU_LAZY_MODE=1. The Habana vLLM backend requires lazy mode; leaving it unset (eager mode) causes graph-compilation failures such as RuntimeError: Graph compile failed ... synStatus 26 [Generic failure] at startup.

Option A: Shared mamba environment (simplest)

This uses the pre-built shared vLLM environment. No SIF to pull — sbatch the script and go. Do not add module load habanalabs/latest; the driver libraries are already available on the gaudi partition nodes.

Qwen2.5-7B on 1x Gaudi2

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-gaudi-qwen7b
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log

module load mamba/latest
source activate gaudi-pytorch-vllm

export HF_HOME="/scratch/$USER"
export PT_HPU_LAZY_MODE=1

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 8192 \
    --block-size 128 \
    --gpu-memory-utilization 0.80 \
    --max-num-seqs 16 \
    --port $PORT

Qwen2.5-72B-Instruct on 8x Gaudi2

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-gaudi-72b
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 8
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log

module load mamba/latest
source activate gaudi-pytorch-vllm

export HF_HOME="/scratch/$USER"
export PT_HPU_LAZY_MODE=1
# Required for multi-card collective communication across HPUs
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 8192 \
    --block-size 128 \
    --gpu-memory-utilization 0.80 \
    --max-num-seqs 16 \
    --port $PORT

Option B: Apptainer container

The examples below run inside the Habana vLLM apptainer image. Note that the Gaudi container is launched without the --nv flag (that flag is NVIDIA-only). The HPU devices under /dev/accel are made available inside the container automatically. The Habana image already contains a working vLLM with the HPU backend, so no module load, mamba, or conda is required.

The scripts below automatically pull the SIF into /scratch/$USER on first run if it isn't there yet, so you don't need to download it manually (don't pull it on the login node — it's slow and competes with other users). The download happens once; subsequent jobs reuse the cached .sif.

Qwen2.5-7B on 1x Gaudi2

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-gaudi-qwen7b
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log
#SBATCH --export=NONE

export HF_HOME="/scratch/$USER"

# path to the Habana vLLM apptainer SIF (auto-pulled below if missing)
SIF=/scratch/$USER/vllm-gaudi-0.21.0.sif
IMAGE=docker://vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptfork-2.10.0:1.24.0-1007

# Habana writes runtime logs here; point it at scratch since the container's
# default /var/log/habana_logs is read-only
export HABANA_LOGS="/scratch/$USER/habana_logs"
mkdir -p "$HABANA_LOGS"

# Download the image once if it isn't already present
if [ ! -f "$SIF" ]; then
    echo "SIF not found, pulling $IMAGE ..."
    apptainer pull "$SIF" "$IMAGE"
fi

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

# start vLLM via apptainer (no --nv: that is for NVIDIA GPUs)
apptainer exec \
    --bind /scratch \
    --env HF_HOME=$HF_HOME \
    --env HABANA_LOGS=$HABANA_LOGS \
    --env PT_HPU_LAZY_MODE=1 \
    $SIF \
    python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 8192 \
    --block-size 128 \
    --gpu-memory-utilization 0.80 \
    --max-num-seqs 16 \
    --port $PORT

Qwen2.5-72B on 8x Gaudi2

Click here to expand...

#!/bin/bash
#SBATCH --job-name=vllm-gaudi-72b
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 8
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log
#SBATCH --export=NONE

export HF_HOME="/scratch/$USER"

# path to the Habana vLLM apptainer SIF (auto-pulled below if missing)
SIF=/scratch/$USER/vllm-gaudi-0.21.0.sif
IMAGE=docker://vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptfork-2.10.0:1.24.0-1007

# Habana writes runtime logs here; point it at scratch since the container's
# default /var/log/habana_logs is read-only
export HABANA_LOGS="/scratch/$USER/habana_logs"
mkdir -p "$HABANA_LOGS"

# Download the image once if it isn't already present
if [ ! -f "$SIF" ]; then
    echo "SIF not found, pulling $IMAGE ..."
    apptainer pull "$SIF" "$IMAGE"
fi

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

apptainer exec \
    --bind /scratch \
    --env HF_HOME=$HF_HOME \
    --env HABANA_LOGS=$HABANA_LOGS \
    --env PT_HPU_LAZY_MODE=1 \
    --env PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
    $SIF \
    python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 8192 \
    --block-size 128 \
    --gpu-memory-utilization 0.80 \
    --max-num-seqs 16 \
    --port $PORT

note

For multi-card tensor parallelism on Gaudi, PT_HPU_ENABLE_LAZY_COLLECTIVES=true is required for correct collective communication across HPUs.

Submitting and connecting (both A100 and Gaudi2)

The workflow for submitting jobs and connecting to the vLLM server is the same regardless of which accelerator you use.

Submit and monitor

sbatch your_job_script.sh

Watch the log for startup progress:

tail -f vllm-<jobid>.log

The server is ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:XXXX

Check the connection info:

cat ~/vllm_endpoint.txt
# Example output: gpu-node-042:8347

Connect from your local machine

Use SSH port forwarding to create a tunnel from your laptop through the login node to the compute node running vLLM.

Open the tunnel on your local machine, run:

ssh -N -L 8000:<compute-node>:<port> <your-username>@<login-node>

Replace <compute-node> and <port> with the values from ~/vllm_endpoint.txt. For example:

ssh -N -L 8000:gpu-node-042:8347 jsmith@sol-login01.rc.asu.edu

Leave this terminal open. It maps localhost:8000 on your laptop to the vLLM server.

Verify the connection in a new terminal on your local machine:

curl http://localhost:8000/v1/models

You should see a JSON response listing the model name.

Send a chat request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain what a GPU is in two sentences."}
    ]
  }'

Replace the model name with whichever model you are serving.

Use a chat UI (optional) — any OpenAI-compatible frontend works. For example, Open WebUI via Docker on your local machine:

docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=unused \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser.

Using vLLM via Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="unused",  # vLLM doesn't require a real API key
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "What is vLLM?"}
    ],
)
print(response.choices[0].message.content)

When you're done

scancel <jobid>

Close the SSH tunnel on your local machine with Ctrl+C.

Troubleshooting

"Connection refused" when curling the server

Check that your SSH tunnel is still open, the hostname and port match ~/vllm_endpoint.txt, and the vLLM process has finished starting (check the Slurm log for the Uvicorn message).

Model download is slow or fails

Model weights are downloaded from Hugging Face on first run to $HF_HOME. Make sure this points to your scratch directory, as models can be 15–140 GB. If a download is interrupted, delete the partial cache directory under $HF_HOME/hub/ and resubmit.

Some models require a Hugging Face token

Some models are gated behind a license agreement on Hugging Face. Go to huggingface.co to create an access token, then add this to your job script before the vLLM command:

export HF_TOKEN="hf_your_token_here"

A100-specific: OOM (Out of Memory) during startup

Add --enforce-eager to skip CUDA graph capture. Reduce --max-model-len or add --max-num-seqs 32. Make sure PYTORCH_ALLOC_CONF=expandable_segments:True is set.

Gaudi2-specific

OOM / MODULE:PT_DEVMEM Allocation failed during warmup

This usually happens during HPU graph capture (the "Decode warmup processing" stage), not while loading weights. The KV cache claimed most of the HBM and left too little for graph buffers. Counterintuitively, lower --gpu-memory-utilization (e.g. 0.80) so more device memory is available for graph capture, and cap --max-num-seqs (e.g. 16) so warmup doesn't build huge decode-batch graphs you won't use. For genuinely large models, increase --tensor-parallel-size to spread weights across more cards. You can also disable HPU Graphs to free the capture memory entirely, though this reduces performance.

Slow first requests

Gaudi2 compiles execution graphs during warmup. The first several requests may be slow as the bucketing mechanism compiles graphs for different tensor shapes. This is normal — subsequent requests at similar lengths will be fast.

hl-smi not found or shows no devices

hl-smi lives inside the Habana apptainer image — run it with apptainer exec $SIF hl-smi. If no devices appear, the node may not have Gaudi2 hardware, or the job was not allocated HPUs — check your -p gaudi, -q public, and -G settings.

Model not supported

If vLLM crashes or produces garbled output, the model may not be compatible with the Gaudi2 backend. Check the supported models table above and fall back to an A100 job.

synStatus 26 [Generic failure] during startup or weight loading

A generic synapse failure on a basic op (e.g. synNodeCreateWithId failed for node: range_f32) usually means the SynapseAI userspace does not match the node's Habana driver/firmware. First make sure export PT_HPU_LAZY_MODE=1 is set (the backend requires lazy mode). If it persists in the shared mamba environment, the environment is likely built against an older SynapseAI than the node — use the apptainer container (Option B), which ships a matched userspace, and report the mismatch to RC.

log file /var/log/habana_logs/... is not writable (apptainer)

Harmless but noisy. The container's default Habana log directory is read-only. Point HABANA_LOGS at a writable scratch directory and pass it into the container, as the Option B scripts do:

export HABANA_LOGS="/scratch/$USER/habana_logs"
mkdir -p "$HABANA_LOGS"
# ...then add to apptainer exec:
#   --env HABANA_LOGS=$HABANA_LOGS

Quick reference: which accelerator do I need?​

NVIDIA A100​

Intel Gaudi2 (96 GB HBM2e per card)​

Apptainer​

A100 Examples​

Small model on 1x A100 40G​

Large model on 2x A100 80G (tensor parallelism)​

Advanced: Qwen3.5-27B on A100 (GDN architecture)​

Gaudi2 Examples​

Option A: Shared mamba environment (simplest)​

Qwen2.5-7B on 1x Gaudi2​

Qwen2.5-72B-Instruct on 8x Gaudi2​

Option B: Apptainer container​

Qwen2.5-7B on 1x Gaudi2​

Qwen2.5-72B on 8x Gaudi2​

Submitting and connecting (both A100 and Gaudi2)​

Submit and monitor​

Connect from your local machine​

Using vLLM via Python​

When you're done​

Troubleshooting​

Gaudi2-specific​

Quick reference: which accelerator do I need?

NVIDIA A100

Intel Gaudi2 (96 GB HBM2e per card)

Apptainer

A100 Examples

Small model on 1x A100 40G

Large model on 2x A100 80G (tensor parallelism)

Advanced: Qwen3.5-27B on A100 (GDN architecture)

Gaudi2 Examples

Option A: Shared mamba environment (simplest)

Qwen2.5-7B on 1x Gaudi2

Qwen2.5-72B-Instruct on 8x Gaudi2

Option B: Apptainer container

Qwen2.5-7B on 1x Gaudi2

Qwen2.5-72B on 8x Gaudi2

Submitting and connecting (both A100 and Gaudi2)

Submit and monitor

Connect from your local machine

Using vLLM via Python

When you're done

Troubleshooting

Gaudi2-specific