GPU Server #

服务器硬件配置 #

GPU: Nvidia 4090 24G显存 x 4 或 Nvidia A6000 48G缓存 x 2 或更高
CPU: Intel Xeon Gold 6300 (3rd Gen) x 2
内存: 8*16GB DDR4 3200, 128GB
硬盘: 1TB SSD x 3 (Raid 5)

服务器部署 #

操作系统: Ubuntu 24.04 ‘Noble Numbat’ (LTS)

安装NVIDIA Driver #

NVIDIA Driver >= 525.105.17

检查和卸载旧版本驱动 #

检查已安装的NVIDIA驱动版本和CUDA版本，并根据需要卸载旧版本。

输入以下命令来检查当前安装的 NVIDIA 驱动版本：

1nvidia-smi

此命令将输出一些关于NVIDIA GPU的信息，包括安装的驱动版本。查看输出中的"Driver Version"和"CUDA Version"。

注: 也可使用cat /proc/driver/nvidia/version查看驱动版本。

如果需要卸载旧版本(驱动版本和CUDA版本不满足要求时)，可以使用下面的命令卸载：

1sudo apt purge nvidia-*

提示
为了确保卸载干净，可使用下面命令进行搜索：
1apt list --installed | grep nvidia
使用下面命令进行卸载：
1apt purge nvidia-*
2apt purge libnvidia-*
3apt purge linux-objects-nvidia*
4apt purge linux-signatures-nvidia*
5apt purge ubuntu-drivers-common
6apt autoremove

卸载后需要重启服务器：

1sudo reboot

安装前准备 #

首先更新系统，主要是更新系统包列表和系统本身，，确保所有的依赖都是最新的。

1sudo apt update
2sudo apt upgrade

安装编译程序所需的基本工具:

1sudo apt install build-essential

接下来禁用Nouveau驱动。

Disable Nouveau 参考这里。

Ubuntu默认使用Nouveau驱动程序来支持NVIDIA的显卡。在安装官方NVIDIA驱动前，需要禁用Nouveau驱动。

创建/etc/modprobe.d/blacklist-nouveau.conf配置文件：

1cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
2blacklist nouveau
3options nouveau modeset=0
4EOF

更新初始化 RAM 文件系统：

1sudo update-initramfs -u

提示 sudo update-initramfs -u这个命令用于在 Linux 系统中更新 initramfs（initial RAM filesystem，初始化内存文件系统）。initramfs 是在 Linux 系统启动过程中非常重要的一个临时文件系统:
它在系统引导早期被加载到内存中
包含引导系统所需的必要驱动程序和工具
帮助挂载真正的根文件系统
可能需要运行这个命令的情况:
安装新的内核模块后
更新显卡驱动后
修改了 initramfs 配置后
系统更新后需要重建 initramfs

重启服务器:

1sudo reboot

安装CUDA Toolkit #

CUDA Toolkit(CUDA 工具包)包含NVIDIA显卡驱动程序。

查看GPU, CUDA Toolkit, and CUDA Drive的版本支持矩阵。选择合适的CUDA Toolkit版本。

例如，从这里下载CUDA Toolkit 12.6，注意选择操作系统为Linux，架构为x86_64，系统的发行版为Ubuntu 24.04，安装类型为本地的runfile。

下载和安装的指令如下:

1wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
2sudo sh cuda_12.6.3_560.35.05_linux.run

等待一会儿，将进入安装协议选择界面:

 1┌──────────────────────────────────────────────────────────────────────────────┐
 2│  End User License Agreement                                                  │
 3│  --------------------------                                                  │
 4│                                                                              │
 5│  NVIDIA Software License Agreement and CUDA Supplement to                    │
 6│  Software License Agreement.                                                 │
 7│                                                                              │
 8│  The CUDA Toolkit End User License Agreement applies to the                  │
 9│  NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA                    │
10│  Display Driver, NVIDIA Nsight tools (Visual Studio Edition),                │
11│  and the associated documentation on CUDA APIs, programming                  │
12│  model and development tools. If you do not agree with the                   │
13│  terms and conditions of the license agreement, then do not                  │
14│  download or use the software.                                               │
15│                                                                              │
16│  Last updated: January 12, 2024.                                             │
17│                                                                              │
18│                                                                              │
19│  Preface                                                                     │
20│  -------                                                                     │
21│                                                                              │
22│──────────────────────────────────────────────────────────────────────────────│
23│ Do you accept the above EULA? (accept/decline/quit):                         │
24│ accept                                                                       │
25└──────────────────────────────────────────────────────────────────────────────┘

在安装协议选择界面输入accept, 会进入到如下界面:

 1┌──────────────────────────────────────────────────────────────────────────────┐
 2│ CUDA Installer                                                               │
 3│ - [X] Driver                                                                 │
 4│      [X] 560.28.03                                                           │
 5│ + [X] CUDA Toolkit 12.6                                                      │
 6│   [X] CUDA Demo Suite 12.6                                                   │
 7│   [X] CUDA Documentation 12.6                                                │
 8│ - [ ] Kernel Objects                                                         │
 9│      [ ] nvidia-fs                                                           │
10│   Options                                                                    │
11│   Install                                                                    │
12│                                                                              │
13│                                                                              │
14│ Up/Down: Move | Left/Right: Expand | 'Enter': Select | 'A': Advanced options │
15└──────────────────────────────────────────────────────────────────────────────┘

选中Install后，直接安装即可。

安装Docker和NVIDIA Container Toolkit #

安装Docker #

安装过程省略，可参考官方文档Install Docker Engine on Ubuntu。

安装NVIDIA Container Toolkit #

参考链接
NVIDIA Container Toolkit架构
NVIDIA Container Toolkit安装文档

配置NVIDIA Container Toolkit的apt源：

1curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
2  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
3    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
4    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

更新软件包列表：

1sudo apt update

安装NVIDIA Container Toolkit:

1sudo apt install -y nvidia-container-toolkit

使用NVIDIA Container Toolkit (nvidia-ctk) 的配置命令，设置Docker运行时以支持NVIDIA GPU：

1nvidia-ctk runtime configure --runtime=docker
2
3WARN[0000] Ignoring runtime-config-override flag for docker 
4INFO[0000] Loading config from /etc/docker/daemon.json  
5INFO[0000] Wrote updated config to /etc/docker/daemon.json 
6INFO[0000] It is recommended that docker daemon be restarted.

上面的命令实际上在：

/etc/docker/daemon.json中加入了:

1,
2    "runtimes": {
3        "nvidia": {
4            "args": [],
5            "path": "nvidia-container-runtime"
6        }
7    }

重启Docker服务:

1sudo systemctl restart docker

查看Docker运行时信息：

1docker info | grep nvidia
2 Runtimes: runc io.containerd.runc.v2 nvidia