LLM execution with vGPU NVIDIA¶
Context:¶
An LLM (Large Language Model) is an artificial inteligence model based on deep neural networks, designed for proces and generate text similar to how humans do. This models are trained with large amount of textual data, which allows them to understand the context, answer questions and create content with remarkable quality. They are powerfull tools that can be adapted to various tasks, such as translation, writing or text analysis, and are the basis for many modern AI aplications.
Ollama, on the other hand, is an open source tool that makes it easier to execute the LLMs in local devices, which allows users take advantage of advenced models without having to rely on cloud services. That means you can have models like DeepSeek, known by it eficiency and multilingual capability, working directly in you computer. This offer greater control over the model and the data that process, in addition to making it accessible in environments without an internet connection.
Using these models locally offers advantages such as greater privacy, since the data does not leave your device, and eliminates the costs associated with cloud services. For example, with DeepSeek running locally through Ollama, a user could customize it for specialized tasks, such as technical analysis or the generation of content in a specific language, such as English, all from their own device. This autonomy and flexibility make it a very valuable option for individual users or organizations with specific needs.
Use case¶
Objective: Run an LLM (Large Language Model) on a virtual desktop with an NVIDIA vGPU. Operating system: Ubuntu 22.04
Create desktop with GPU¶
We will follow the following steps:
1. We will create a desktop with Ubuntu Operating Sistem. (Manual for creating Ubuntu 22.04 desktop: https://isard.gitlab.io/isardvdi-docs/guests/ubuntu_22.04/desktop/installation/installation.ca/)
-
We'll assign the profile of the GPU we have available on the Isard infrastructure.
-
We will leave the "Default" video card
Operating System Update¶
sudo apt update
sudo apt upgrade -y
Previous steps to use NVIDIA with docker and by computation¶
Select NVIDIA only by computation¶
It is important that the NVIDIA vGPU card is not configured as the primary graphics device of the system. If the vGPU is responsible for rendering the graphical environment, it will consume resources that should be used for its primary function: data processing. This affects efficiency and performance in computationally intensive tasks. Therefore, it is advisable to assign a secondary GPU for graphics management and reserve the vGPU exclusively for data processing.
The "default" video card is actually the QXL, we tell xorg to use this card.
isard@ubuntu22:~$ cat /etc/X11/xorg.conf.d/10-qxl.conf
Section "Device"
Identifier "QXL"
Driver "qxl"
BusID "PCI:0:1:0" # Adjust this based on the output of lspci by QXL
EndSection
Section "Screen"
Identifier "Screen0"
Device "QXL"
EndSection
With QXL we are interested in not booting wayland with the gdm3 login manager, it is necessary to validate that we have:
# GDM configuration storage
#
# See /usr/share/gdm/gdm.schemas for a list of available options.
[daemon]
AutomaticLoginEnable=true
AutomaticLogin=isard
WaylandEnable=false
Install correct NVIDIA drivers¶
We must have the drivers according to the version of the kernel module of the server. It is best to install the Long-Term Support version (v16 - R535)
vGPU Software Release | Driver Branch | vGPU Branch Type | Latest Release in Branch | Release Date | EOL Date |
---|---|---|---|---|---|
NVIDIA vGPU 18 | R570 | Production | 18.0 | March 2025 | March 2026 |
NVIDIA vGPU 17 | R550 | Production | 17.5 | January 2025 | June 2025 |
NVIDIA vGPU 16 | R535 | Long-Term Support | 16.9 | January 2025 | July 2026 |
-
Manuals for the correct installation of the drivers and the token.
-
Drivers installation: https://isard.gitlab.io/isardvdi-docs/user/gpu/#installation
-
Token installation: https://isard.gitlab.io/isardvdi-docs/user/gpu/#token-management
-
sudo cp /media/isard/nvidia/nvidia_16/Guest_Drivers/nvidia-linux-grid-535_535.230.02_amd64.deb /opt/
sudo apt install /opt/nvidia-linux-grid-535_535.230.02_amd64.deb -y
reboot
We verify with nvidia-smi that the driver and version of CUDA are detected:
isard@ubuntu22:~$ nvidia-smi
Mon Mar 10 17:23:13 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A16-4Q On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 57MiB / 4096MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1263 C+G ...libexec/gnome-remote-desktop-daemon 57MiB |
+---------------------------------------------------------------------------------------+
There is a "gnome-remote-desktop-daemon" process that is for rdp access, and that should not be used by NVIDIA. It's a bug and at the moment we haven't been able to avoid it.
NVIDIA Licensing¶
You need to add an NVIDIA license by modifying the FeatureType in gridd.conf:
isard@ubuntu22:/opt/llm$ sudo grep FeatureType /etc/nvidia/gridd.conf
FeatureType=1
And add token file to the ClientConfigToken directory:
isard@ubuntu22:/opt/llm$ sudo ls -lh /etc/nvidia/ClientConfigToken/
total 4,0K
-rw-r--r-- 1 root root 2,7K mar 10 18:30 token2.tok
Restart of the licensing service:
sudo systemctl restart nvidia-gridd.service
Verify that you are licensed with nvidia-smi -q:
isard@ubuntu22:/opt/llm$ nvidia-smi -q |grep "License Status"
License Status : Licensed (Expiry: 2025-3-11 17:31:20 GMT)
We restart and verify that uses qxl with this command:
sudo apt install mesa-utils -y
glxinfo |grep -i nvidia
And no NVIDIA line has to show up.
Install docker¶
https://docs.docker.com/engine/install/ubuntu/
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
sudo docker run hello-world
Add to docker that can use gpus:¶
We will need nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
We verify that the configuration file has been modified:
isard@ubuntu22:~$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
Now restart the docker service and verify that nvidia-smi works and accepts and recognizes the gpus:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
Docker compose¶
mkdir /opt/llm
cd /opt/llm
Create file docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- /opt/ollama/volume_ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- OLLAMA_HOST=0.0.0.0:11434
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-WebUI
ports:
- "3000:8080"
volumes:
- /opt/ollama/openwebui_backend_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
extra_hosts:
- "host.docker.internal:host-gateway"
restart: always
Start up docker compose:
docker compose up -d
Model's run:
Model | Requirements | Command |
---|---|---|
1.5B Parameters | 1.1 GB approx. | ollama run deepseek-r1:1.5b |
7B Parameters | 4.7 GB approx. | ollama run deepseek-r1 |
70B Parameters | +20 GB of vRAM | ollama run deepseek-r1:70b |
671B Parameters | +300 GB of vRAM | ollama run deepseek-r1:671b |
sudo docker exec -ti ollama ollama run deepseek-r1:1.5b
In a browser we enter the following link: http://localhost:3000
Verify that respond with python¶
sudo apt install python3-virtualenv -y
mkdir -p dev/llm_python
cd dev/llm_python
virtualenv venv
source venv/bin/activate
Inside venv:
pip3 install ipython openai
ipython3
python code, we do copy/paste:
import openai
# Connect to Ollama
client = openai.Client(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
response = client.chat.completions.create(
model="deepseek-r1", # Change 1.5b for the version installed
messages=[{"role": "user", "content": "Hellow in catalan"}],
temperature=0.7
)
print(response.choices[0].message.content)