Skip to content

LLM execution with vGPU NVIDIA

Context:

An LLM (Large Language Model) is an artificial inteligence model based on deep neural networks, designed for proces and generate text similar to how humans do. This models are trained with large amount of textual data, which allows them to understand the context, answer questions and create content with remarkable quality. They are powerfull tools that can be adapted to various tasks, such as translation, writing or text analysis, and are the basis for many modern AI aplications.

Ollama, on the other hand, is an open source tool that makes it easier to execute the LLMs in local devices, which allows users take advantage of advenced models without having to rely on cloud services. That means you can have models like DeepSeek, known by it eficiency and multilingual capability, working directly in you computer. This offer greater control over the model and the data that process, in addition to making it accessible in environments without an internet connection.

Using these models locally offers advantages such as greater privacy, since the data does not leave your device, and eliminates the costs associated with cloud services. For example, with DeepSeek running locally through Ollama, a user could customize it for specialized tasks, such as technical analysis or the generation of content in a specific language, such as English, all from their own device. This autonomy and flexibility make it a very valuable option for individual users or organizations with specific needs.

Use case

Objective: Run an LLM (Large Language Model) on a virtual desktop with an NVIDIA vGPU. Operating system: Ubuntu 22.04

Create desktop with GPU

We will follow the following steps:

1. We will create a desktop with Ubuntu Operating Sistem. (Manual for creating Ubuntu 22.04 desktop: https://isard.gitlab.io/isardvdi-docs/guests/ubuntu_22.04/desktop/installation/installation.ca/)

  • We'll assign the profile of the GPU we have available on the Isard infrastructure.

  • We will leave the "Default" video card

Operating System Update

sudo apt update 
sudo apt upgrade -y 

Previous steps to use NVIDIA with docker and by computation

Select NVIDIA only by computation

It is important that the NVIDIA vGPU card is not configured as the primary graphics device of the system. If the vGPU is responsible for rendering the graphical environment, it will consume resources that should be used for its primary function: data processing. This affects efficiency and performance in computationally intensive tasks. Therefore, it is advisable to assign a secondary GPU for graphics management and reserve the vGPU exclusively for data processing.

The "default" video card is actually the QXL, we tell xorg to use this card.

isard@ubuntu22:~$ cat /etc/X11/xorg.conf.d/10-qxl.conf

Section "Device"
    Identifier "QXL"
    Driver "qxl"
    BusID "PCI:0:1:0" # Adjust this based on the output of lspci by QXL 
EndSection
Section "Screen"
    Identifier "Screen0"
    Device "QXL"
EndSection

With QXL we are interested in not booting wayland with the gdm3 login manager, it is necessary to validate that we have:

# GDM configuration storage
#
# See /usr/share/gdm/gdm.schemas for a list of available options.

[daemon]
AutomaticLoginEnable=true
AutomaticLogin=isard
WaylandEnable=false

Install correct NVIDIA drivers

We must have the drivers according to the version of the kernel module of the server. It is best to install the Long-Term Support version (v16 - R535)

vGPU Software Release Driver Branch vGPU Branch Type Latest Release in Branch Release Date EOL Date
NVIDIA vGPU 18 R570 Production 18.0 March 2025 March 2026
NVIDIA vGPU 17 R550 Production 17.5 January 2025 June 2025
NVIDIA vGPU 16 R535 Long-Term Support 16.9 January 2025 July 2026
  • Manuals for the correct installation of the drivers and the token.

    • Drivers installation: https://isard.gitlab.io/isardvdi-docs/user/gpu/#installation

    • Token installation: https://isard.gitlab.io/isardvdi-docs/user/gpu/#token-management

sudo cp /media/isard/nvidia/nvidia_16/Guest_Drivers/nvidia-linux-grid-535_535.230.02_amd64.deb /opt/
sudo apt install /opt/nvidia-linux-grid-535_535.230.02_amd64.deb -y
reboot

We verify with nvidia-smi that the driver and version of CUDA are detected:

isard@ubuntu22:~$ nvidia-smi
Mon Mar 10 17:23:13 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A16-4Q                  On  | 00000000:06:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |     57MiB /  4096MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1263    C+G   ...libexec/gnome-remote-desktop-daemon       57MiB |
+---------------------------------------------------------------------------------------+

There is a "gnome-remote-desktop-daemon" process that is for rdp access, and that should not be used by NVIDIA. It's a bug and at the moment we haven't been able to avoid it.

NVIDIA Licensing

You need to add an NVIDIA license by modifying the FeatureType in gridd.conf:

isard@ubuntu22:/opt/llm$ sudo grep FeatureType /etc/nvidia/gridd.conf 
FeatureType=1

And add token file to the ClientConfigToken directory:

isard@ubuntu22:/opt/llm$ sudo ls -lh /etc/nvidia/ClientConfigToken/
total 4,0K
-rw-r--r-- 1 root root 2,7K mar 10 18:30 token2.tok

Restart of the licensing service:

sudo systemctl restart nvidia-gridd.service

Verify that you are licensed with nvidia-smi -q:

isard@ubuntu22:/opt/llm$ nvidia-smi  -q |grep "License Status"
        License Status                    : Licensed (Expiry: 2025-3-11 17:31:20 GMT)

We restart and verify that uses qxl with this command:

sudo apt install mesa-utils -y
glxinfo |grep -i nvidia

And no NVIDIA line has to show up.

Install docker

https://docs.docker.com/engine/install/ubuntu/

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
sudo docker run hello-world

Add to docker that can use gpus:

We will need nvidia-container-toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |   sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker

We verify that the configuration file has been modified:

isard@ubuntu22:~$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Now restart the docker service and verify that nvidia-smi works and accepts and recognizes the gpus:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Docker compose

mkdir /opt/llm
cd /opt/llm

Create file docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
       - "11434:11434"
    volumes:
       - /opt/ollama/volume_ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
       - OLLAMA_HOST=0.0.0.0:11434
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-WebUI
    ports:
       - "3000:8080"
    volumes:
      - /opt/ollama/openwebui_backend_data:/app/backend/data
    environment:
       - OLLAMA_BASE_URL=http://ollama:11434
    extra_hosts:
       - "host.docker.internal:host-gateway"
    restart: always

Start up docker compose:

docker compose up -d

Model's run:

Model Requirements Command
1.5B Parameters 1.1 GB approx. ollama run deepseek-r1:1.5b
7B Parameters 4.7 GB approx. ollama run deepseek-r1
70B Parameters +20 GB of vRAM ollama run deepseek-r1:70b
671B Parameters +300 GB of vRAM ollama run deepseek-r1:671b
sudo docker exec -ti ollama ollama run deepseek-r1:1.5b

In a browser we enter the following link: http://localhost:3000

Verify that respond with python

sudo apt install python3-virtualenv -y
mkdir -p dev/llm_python
cd dev/llm_python
virtualenv venv
source venv/bin/activate

Inside venv:

pip3 install ipython openai
ipython3

python code, we do copy/paste:

import openai

# Connect to Ollama

client = openai.Client(

    base_url="http://localhost:11434/v1",

    api_key="ollama"

)

response = client.chat.completions.create(

    model="deepseek-r1",  # Change 1.5b for the version installed

    messages=[{"role": "user", "content": "Hellow in catalan"}],

    temperature=0.7

)

print(response.choices[0].message.content)

Last update: March 25, 2025