Hosting LLMs with vLLM
This guide explains how to run a vLLM inference server on Sol and connect to it from your local machine. vLLM serves large language models as an OpenAI-compatible API endpoint on our A100 and Gaudi2 accelerators, so you can integrate them into your code. If you only need an API key (for popular open source models), please follow this guide to request a free key and avoid spending your fairshare on Sol.
Use our shared mamba environments (see below). Do not use uv, conda, or pip.
Quick reference: which accelerator do I need?
All recommendations assume bf16 precision (no quantization). Note that A100s on Sol are always in high-demand, and will lead to a large deduction to your fairshare. A100 MIG instances do not support the peer-to-peer GPU communication needed for tensor parallelism, and is too small for practical LLM serving.
NVIDIA A100
| Model | Parameters | GPU requirement | Max context | Notes |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 7B | 1x A100 40G | 32,768 | Good starting point |
| Llama-3.1-8B-Instruct | 8B | 1x A100 40G | 8,192 | 128k native, but limited by 40G VRAM |
| Mistral-7B-Instruct-v0.3 | 7B | 1x A100 40G | 32,768 | Fast, good for general tasks |
| Phi-4 | 14B | 1x A100 80G | 16,384 | Needs more than 40G GPU memory |
| Qwen2.5-32B-Instruct | 32B | 2x A100 40G or 1x A100 80G | 32,768 | tp=2 on 40G cards, tp=1 on 80G |
| Llama-3.1-70B-Instruct | 70B | 4x A100 80G | 8,192 | ~140 GiB weights, tight fit |
| Qwen2.5-72B-Instruct | 72B | 4x A100 80G | 32,768 | Similar to Llama-3.1-70B |
| Qwen3.5-27B | 27B | 2x A100 80G | 131,072 | Requires advanced JIT setup (see below) |
Intel Gaudi2 (96 GB HBM2e per card)
The models listed below are officially validated by Intel. Other models (Qwen2.5, Phi-4, etc.) may work but are untested on Gaudi2.
| Model | Parameters | HPU requirement | Max context | Notes |
|---|---|---|---|---|
| Llama-3.1-8B-Instruct | 8B | 1x Gaudi2 | 8,192 | Validated, good starting point |
| Mistral-7B-Instruct-v0.3 | 7B | 1x Gaudi2 | 8,192 | Validated on single HPU |
| Llama-3.1-70B-Instruct | 70B | 8x Gaudi2 | 8,192 | Validated with tp=8 |
| Mixtral-8x7B-Instruct-v0.1 | 47B (MoE) | 2x Gaudi2 | 8,192 | Validated, sparse MoE |
A100 Examples
Each example below starts a vLLM server, writes connection info to a file, and keeps running until the job ends or is cancelled.
Small model on 1x A100 40G
Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:1
#SBATCH -c 16
#SBATCH --mem=64G
#SBATCH -t 04:00:00
#SBATCH --export=NONE
module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm
export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"
# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"
# start vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port $PORT
Large model on 2x A100 80G (tensor parallelism)
Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE
module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm
export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"
# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-32B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port $PORT
Advanced: Qwen3.5-27B on A100 (GDN architecture)
Qwen3.5-27B uses a newer attention mechanism called GDN (Gated Delta Networks) that is not pre-compiled in the vLLM package. Instead, a library called flashinfer compiles GPU kernels at runtime using nvcc and g++. This requires additional setup compared to the standard models listed above.
Click here to expand...
#!/bin/bash
#SBATCH --job-name=vllm-qwen7b
#SBATCH -p htc
#SBATCH -N 1
#SBATCH -G a100:2
#SBATCH -c 16
#SBATCH --mem=128G
#SBATCH -t 04:00:00
#SBATCH --export=NONE
module load mamba/latest
module load cuda-12.9.0-gcc-12.1.0
module load gcc-13.2.0-gcc-12.1.0
source activate /path/to/shared/envs/vllm
export CUDA_HOME=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/cuda-12.9.0-iypyscneizzfss2s6w6ul3c4wefggvfg
export CXX=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/gcc-13.2.0-3axqolu2r5t7p3j5yphhzjfku4rbga2y/bin/g++
export CC=/packages/apps/spack/21/opt/spack/linux-rocky8-zen3/gcc-12.1.0/gcc-13.2.0-3axqolu2r5t7p3j5yphhzjfku4rbga2y/bin/gcc
export PYTORCH_ALLOC_CONF=expandable_segments:True
export HF_HOME="/scratch/$USER"
# pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM starting on $(hostname):${PORT} at $(date)"
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--enforce-eager \
--reasoning-parser qwen3 \
--language-model-only \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.95 \
--port $PORT
CUDA 12.9 must be used instead of CUDA 13.0 because CUDA 13.0's stricter C++17 dialect requirements break flashinfer's JIT compilation.
First request will be slow. The GDN kernel takes 1–3 minutes to compile on the first inference request. Subsequent requests will be fast, pre-compiled kernels will be cached in ~/.cache/flashinfer/.
Qwen3.5-27B Thinking mode is enabled by default with --reasoning-parser qwen3. The model's chain-of-thought reasoning is returned in the reasoning_content field of the API response. To disable thinking for simpler tasks:
response = client.chat.completions.create(
model="Qwen/Qwen3.5-27B",
messages=[{"role": "user", "content": "Say hello"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
Gaudi2 Examples
Intel Gaudi2 accelerators, "HPUs", have 96 GB of HBM2e memory per card and are a good alternative to A100s for supported models. We have benchmarked that the HPUs outperform A100s in many use cases.
Llama-3.1-8B on 1x Gaudi2
Click here to expand...
#!/bin/bash
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log
module load mamba/latest
module load habanalabs/latest
source activate /path/to/shared/envs/vllm-gaudi
export HF_HOME="/scratch/$USER"
# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--block-size 128 \
--gpu-memory-utilization 0.90 \
--port $PORT
Llama-3.1-70B on 8x Gaudi2
Click here to expand...
#!/bin/bash
#SBATCH -p gaudi
#SBATCH -q public
#SBATCH -N 1
#SBATCH -G 8
#SBATCH -t 0-4
#SBATCH -c 18
#SBATCH --output=vllm-gaudi-%j.log
module load mamba/latest
module load habanalabs/latest
source activate /path/to/shared/envs/vllm-gaudi
export HF_HOME="/scratch/$USER"
# Pick a random port and save connection info
PORT=$(shuf -i 8000-9000 -n 1)
echo "$(hostname):${PORT}" > ~/vllm_endpoint.txt
echo "vLLM (Gaudi2) starting on $(hostname):${PORT} at $(date)"
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--block-size 128 \
--gpu-memory-utilization 0.90 \
--port $PORT
Submitting and connecting (both A100 and Gaudi2)
The workflow for submitting jobs and connecting to the vLLM server is the same regardless of which accelerator you use.
Submit and monitor
sbatch your_job_script.sh
Watch the log for startup progress:
tail -f vllm-<jobid>.log
The server is ready when you see:
INFO: Uvicorn running on http://0.0.0.0:XXXX
Check the connection info:
cat ~/vllm_endpoint.txt
# Example output: gpu-node-042:8347
Connect from your local machine
Use SSH port forwarding to create a tunnel from your laptop through the login node to the compute node running vLLM.
Open the tunnel on your local machine, run:
ssh -N -L 8000:<compute-node>:<port> <your-username>@<login-node>
Replace <compute-node> and <port> with the values from ~/vllm_endpoint.txt. For example:
ssh -N -L 8000:gpu-node-042:8347 jsmith@sol-login01.rc.asu.edu
Leave this terminal open. It maps localhost:8000 on your laptop to the vLLM server.
Verify the connection in a new terminal on your local machine:
curl http://localhost:8000/v1/models
You should see a JSON response listing the model name.
Send a chat request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Explain what a GPU is in two sentences."}
]
}'
Replace the model name with whichever model you are serving.
Use a chat UI (optional) — any OpenAI-compatible frontend works. For example, Open WebUI via Docker on your local machine:
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=unused \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser.
Using vLLM via Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused", # vLLM doesn't require a real API key
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "What is vLLM?"}
],
)
print(response.choices[0].message.content)
When you're done
scancel <jobid>
Close the SSH tunnel on your local machine with Ctrl+C.
Troubleshooting
- "Connection refused" when curling the server
Check that your SSH tunnel is still open, the hostname and port match ~/vllm_endpoint.txt, and the vLLM process has finished starting (check the Slurm log for the Uvicorn message).
- Model download is slow or fails
Model weights are downloaded from Hugging Face on first run to $HF_HOME. Make sure this points to your scratch directory, as models can be 15–140 GB. If a download is interrupted, delete the partial cache directory under $HF_HOME/hub/ and resubmit.
- Some models require a Hugging Face token
Llama and some other models are gated behind a license agreement. Go go huggingface.co to create an access token, then add this to your job script before the vLLM command:
export HF_TOKEN="hf_your_token_here"
- A100-specific: OOM (Out of Memory) during startup
Add --enforce-eager to skip CUDA graph capture. Reduce --max-model-len or add --max-num-seqs 32. Make sure PYTORCH_ALLOC_CONF=expandable_segments:True is set.
Gaudi2-specific
- OOM
Increase --gpu-memory-utilization (default 0.9). If that's not enough, increase --tensor-parallel-size to spread weights across more cards. You can also disable HPU Graphs to free memory used by graph capture, though this reduces performance.
- Slow first requests
Gaudi2 compiles execution graphs during warmup. The first several requests may be slow as the bucketing mechanism compiles graphs for different tensor shapes. This is normal — subsequent requests at similar lengths will be fast.
hl-sminot found or shows no devices
The Habana driver module is not loaded. Add module load habanalabs/latest to your job script. If the module is loaded but no devices appear, the node may not have Gaudi2 hardware — check your --partition and --gres settings.
- Model not supported
If vLLM crashes or produces garbled output, the model may not be compatible with the Gaudi2 backend. Check the validated models table above and fall back to an A100 job.