OLLAMA
OLLAMA is a lightweight model inference platform that simplifies deploying large language models (LLMs). This tool streamlines interacting with LLMs by eliminating the need to create or run scripts, gives access to a wide variety of models, and reduces the resources required to run the models by downloading quantized versions.
Running OLLAMA
- interactive
- Ollama will not run on a login node. Request an compute node with a GPU.
$ interactive -t 30 -G 1 -p htc
- Load the
ollamamodule (from available versions amongmodule avail ollama).
$ module load ollama/0.3.12
- Start Ollama server in the background
$ ollama-start
- Run the model. You can find a list of available models here.
The first time the model is run, Ollama automatically performs an
ollama pulland downloads the model. If the model is downloaded, it loads it into memory and starts the chat.
$ ollama run llama3.2
- To stop the model you can type
\byeon the prompt input.
>>> \bye
- To stop the Ollama server:
ollama-stop