Skip to main content

Hosting LLMs with vLLM

This guide explains how to run a vLLM inference server on Sol and connect to it from your local machine. vLLM serves large language models as an OpenAI-compatible API endpoint on our A100 and Gaudi2 accelerators, so you can integrate them into your code. If you only need an API key (for popular open source models), please follow this guide to request a free key and avoid spending your fairshare on Sol.

caution

Use our shared mamba environments (see below). Do not use uv, conda, or pip.


Quick reference: which accelerator do I need?

All recommendations assume bf16 precision (no quantization). Note that A100s on Sol are always in high-demand, and will lead to a large deduction to your fairshare. A100 MIG instances do not support the peer-to-peer GPU communication needed for tensor parallelism, and is too small for practical LLM serving.

NVIDIA A100

ModelParametersGPU requirementMax contextNotes
Qwen2.5-7B-Instruct7B1x A100 40G32,768Good starting point
Llama-3.1-8B-Instruct8B1x A100 40G8,192128k native, but limited by 40G VRAM
Mistral-7B-Instruct-v0.37B1x A100 40G32,768Fast, good for general tasks
Phi-414B1x A100 80G16,384Needs more than 40G GPU memory
Qwen2.5-32B-Instruct32B2x A100 40G or 1x A100 80G32,768tp=2 on 40G cards, tp=1 on 80G
Llama-3.1-70B-Instruct70B4x A100 80G8,192~140 GiB weights, tight fit
Qwen2.5-72B-Instruct72B4x A100 80G32,768Similar to Llama-3.1-70B
Qwen3.5-27B27B2x A100 80G131,072Requires advanced JIT setup (see below)

Intel Gaudi2 (96 GB HBM2e per card)

The models listed below are officially validated by Intel. Other models (Qwen2.5, Phi-4, etc.) may work but are untested on Gaudi2.

ModelParametersHPU requirementMax contextNotes
Llama-3.1-8B-Instruct8B1x Gaudi28,192Validated, good starting point
Mistral-7B-Instruct-v0.37B1x Gaudi28,192Validated on single HPU
Llama-3.1-70B-Instruct70B8x Gaudi28,192Validated with tp=8
Mixtral-8x7B-Instruct-v0.147B (MoE)2x Gaudi28,192Validated, sparse MoE

A100 Examples

Each example below starts a vLLM server, writes connection info to a file, and keeps running until the job ends or is cancelled.

Small model on 1x A100 40G

Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:1
#SBATCH -c 16
#SBATCH --mem=64G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm

export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

# start vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port $PORT

Large model on 2x A100 80G (tensor parallelism)

Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm

export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port $PORT

Advanced: Qwen3.5-27B on A100 (GDN architecture)

Qwen3.5-27B uses a newer attention mechanism called GDN (Gated Delta Networks) that is not pre-compiled in the vLLM package. Instead, a library called flashinfer compiles GPU kernels at runtime using nvcc and g++. This requires additional setup compared to the standard models listed above.

Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE

module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
module load gcc-13.2.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm

export CUDA_HOME=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/cuda-12.9.0-iypyscneizzfss2s6w6ul3c4wefggvfg
export CXX=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/gcc-13.2.0-3axqolu2r5t7p3j5yphhzjfku4rbga2y/bin/g++
export CC=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/gcc-13.2.0-3axqolu2r5t7p3j5yphhzjfku4rbga2y/bin/gcc
export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"

# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--enforce-eager \
--reasoning-parser qwen3 \
--language-model-only \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.95 \
--port $PORT
note

CUDA 12.9 must be used instead of CUDA 13.0 because CUDA 13.0's stricter C++17 dialect requirements break flashinfer's JIT compilation.

note

First request will be slow. The GDN kernel takes 1–3 minutes to compile on the first inference request. Subsequent requests will be fast, pre-compiled kernels will be cached in ~/.cache/flashinfer/.

Qwen3.5-27B Thinking mode is enabled by default with --reasoning-parser qwen3. The model's chain-of-thought reasoning is returned in the reasoning_content field of the API response. To disable thinking for simpler tasks:

response = client.chat.completions.create(
model="Qwen/Qwen3.5-27B",
messages=[{"role": "user", "content": "Say hello"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Gaudi2 Examples

Intel Gaudi2 accelerators, "HPUs", have 96 GB of HBM2e memory per card and are a good alternative to A100s for supported models. We have benchmarked that the HPUs outperform A100s in many use cases.

Llama-3.1-8B on 1x Gaudi2

Click here to expand...
#!/bin/bash
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log

module load mamba/latest
module load habanalabs/latest
source activate /path/to/shared/envs/vllm-gaudi

export HF_HOME="/scratch/$USER"

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--block-size 128 \
--gpu-memory-utilization 0.90 \
--port $PORT

Llama-3.1-70B on 8x Gaudi2

Click here to expand...
#!/bin/bash
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 8
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log

module load mamba/latest
module load habanalabs/latest
source activate /path/to/shared/envs/vllm-gaudi

export HF_HOME="/scratch/$USER"

# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--block-size 128 \
--gpu-memory-utilization 0.90 \
--port $PORT

Submitting and connecting (both A100 and Gaudi2)

The workflow for submitting jobs and connecting to the vLLM server is the same regardless of which accelerator you use.

Submit and monitor

sbatch your_job_script.sh

Watch the log for startup progress:

tail -f vllm-<jobid>.log

The server is ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:XXXX

Check the connection info:

cat ~/vllm_endpoint.txt
# Example output: gpu-node-042:8347

Connect from your local machine

Use SSH port forwarding to create a tunnel from your laptop through the login node to the compute node running vLLM.

Open the tunnel on your local machine, run:

ssh -N -L 8000:<compute-node>:<port> <your-username>@<login-node>

Replace <compute-node> and <port> with the values from ~/vllm_endpoint.txt. For example:

ssh -N -L 8000:gpu-node-042:8347 jsmith@sol-login01.rc.asu.edu

Leave this terminal open. It maps localhost:8000 on your laptop to the vLLM server.

Verify the connection in a new terminal on your local machine:

curl http://localhost:8000/v1/models

You should see a JSON response listing the model name.

Send a chat request:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Explain what a GPU is in two sentences."}
]
}'

Replace the model name with whichever model you are serving.

Use a chat UI (optional) — any OpenAI-compatible frontend works. For example, Open WebUI via Docker on your local machine:

docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=unused \
ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser.


Using vLLM via Python

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused", # vLLM doesn't require a real API key
)

response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "What is vLLM?"}
],
)
print(response.choices[0].message.content)

When you're done

scancel <jobid>

Close the SSH tunnel on your local machine with Ctrl+C.


Troubleshooting

  1. "Connection refused" when curling the server

Check that your SSH tunnel is still open, the hostname and port match ~/vllm_endpoint.txt, and the vLLM process has finished starting (check the Slurm log for the Uvicorn message).

  1. Model download is slow or fails

Model weights are downloaded from Hugging Face on first run to $HF_HOME. Make sure this points to your scratch directory, as models can be 15–140 GB. If a download is interrupted, delete the partial cache directory under $HF_HOME/hub/ and resubmit.

  1. Some models require a Hugging Face token

Llama and some other models are gated behind a license agreement. Go go huggingface.co to create an access token, then add this to your job script before the vLLM command:

export HF_TOKEN="hf_your_token_here"
  1. A100-specific: OOM (Out of Memory) during startup

Add --enforce-eager to skip CUDA graph capture. Reduce --max-model-len or add --max-num-seqs 32. Make sure PYTORCH_ALLOC_CONF=expandable_segments:True is set.

Gaudi2-specific

  1. OOM

Increase --gpu-memory-utilization (default 0.9). If that's not enough, increase --tensor-parallel-size to spread weights across more cards. You can also disable HPU Graphs to free memory used by graph capture, though this reduces performance.

  1. Slow first requests

Gaudi2 compiles execution graphs during warmup. The first several requests may be slow as the bucketing mechanism compiles graphs for different tensor shapes. This is normal — subsequent requests at similar lengths will be fast.

  1. hl-smi not found or shows no devices

The Habana driver module is not loaded. Add module load habanalabs/latest to your job script. If the module is loaded but no devices appear, the node may not have Gaudi2 hardware — check your --partition and --gres settings.

  1. Model not supported

If vLLM crashes or produces garbled output, the model may not be compatible with the Gaudi2 backend. Check the validated models table above and fall back to an A100 job.