📖 前言
在 Dell R730xd 这样的服务器上利用 ESXi 进行显卡直通(PCIe Passthrough),是构建低成本 AI 计算平台的经典方案。然而,在配置 Debian 13 虚拟机并尝试运行大语言模型(LLM)时,许多用户(包括我自己)都会卡在最后一步——NVIDIA 驱动安装。
历经多次踩坑,尝试了各种参数和旧版驱动无果后,最终发现了解决此问题的“银弹”:-m=kernel-open。本文将重点记录这一关键步骤,帮助大家少走弯路。
第一部分:ESXi 宿主机配置
在进入系统安装前,请确保 ESXi 的基础环境已按标准配置好。
- BIOS 配置:
- 开启直通:
- 在 ESXi 管理界面,进入“管理 -> 硬件 -> PCI 设备”。
- 找到 NVIDIA 显卡(2080Ti)并开启直通,随后重启宿主机。
- 虚拟机设置:
- 固件:设置为 EFI。
- 安全引导:务必 关闭。
- 高级参数:在“虚拟机选项 -> 高级 -> 配置参数”中添加以下键值对:
| 键 | 值 | 说明 |
|---|---|---|
hypervisor.cpuid.v0 |
FALSE |
隐藏虚拟化特征,防止驱动拒绝加载 |
pciPassthru.use64bitMMIO |
TRUE |
启用 64 位 MMIO 空间 |
pciPassthru.64bitMMIOSizeGB |
32 |
2080Ti 22G 显存建议设为 32 |
第二部分:Debian 13 客户机配置
-
基础环境准备
更新源并安装编译依赖:sudo apt update sudo apt install linux-headers-$(uname -r) build-essential -
下载驱动
推荐使用较新的驱动版本以获得更好的 CUDA 支持。 -
核心步骤:使用开放内核模式安装
这是破局的关键!在 ESXi 这种虚拟化环境下,如果使用默认方式安装新版驱动(如 595.80),安装程序通常无法正确识别透传的物理显卡,从而提示 “No devices were found”。
根据实战经验(参考相关技术博客),在驱动版本大于 470 的情况下,必须使用
-m=kernel-open参数强制使用开源内核模式。执行以下命令:
chmod +x NVIDIA-Linux-x86_64-595.80.run sudo ./NVIDIA-Linux-x86_64-595.80.run -m=kernel-open- 参数解释:
-m=kernel-open指示安装程序以开放模式构建和加载内核模块,绕过虚拟化环境下的设备检测限制。 - 参考指南: https://blog.csdn.net/qq_18893835/article/details/145655530
- 参数解释:
-
处理 Nouveau 冲突
安装过程中,如果检测到开源驱动nouveau,安装程序会自动提示冲突。- 操作:全程跟随安装向导的提示进行。安装程序通常会提供交互式选项,询问是否生成禁用 nouveau 的配置文件(如
nvidia-installer-disable-nouveau.conf)并自动触发update-initramfs。 - 后续:根据提示重启系统,重启之后再次执行安装程序。
- 操作:全程跟随安装向导的提示进行。安装程序通常会提供交互式选项,询问是否生成禁用 nouveau 的配置文件(如
-
提示信息记录
WARNING: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. [OK] Nouveau can usually be disabled by adding files to the modprobe configuration directories and rebuilding the initramfs. Would you like nvidia-installer to attempt to create these modprobe configuration files for you? [Yes]/No One or more modprobe configuration files to disable Nouveau have been written. You will need to reboot your system and possibly rebuild the initramfs before these changes can take effect. Note if you later wish to reenable Nouveau, you will need to delete these files: /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf [OK] nvidia-installer is not able to perform some of the sanity checks which detect potential installation problems while Nouveau is loaded. Would you like to continue installation without these sanity checks, or abort installation, confirm that Nouveau has been properly disabled, and attempt installation again later? Continue installation / [Abort installation] The initramfs will likely need to be rebuilt due to the following condition(s): * nvidia-installer attempted to disable Nouveau. Would you like to rebuild the initramfs? Do not rebuild initramfs / [Rebuild initramfs] ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com. [OK]WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver. [OK] WARNING: Unable to find a suitable destination to install 32-bit compatibility libraries. Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option. [OK] WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path. [OK] Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. Yes / [No] Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 595.80) is now complete. Please update your xorg.conf file as appropriate; see the file /usr/share/doc/NVIDIA_GLX-1.0/README.txt for details. [OK]
第三部分:验证与总结
驱动安装完成之后,打开终端执行:
DEBIAN:~$ nvidia-smi
Wed Jun 17 21:40:50 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.80 Driver Version: 595.80 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:1B:00.0 Off | N/A |
| 32% 40C P8 9W / 250W | 9MiB / 22528MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
如果能看到熟悉的显卡信息表,包括型号、驱动版本(595.80)和 CUDA 版本,那么恭喜您,驱动安装大功告成!现在您可以在这个纯净的 Debian 环境中尽情运行 LLM 了。
第四部分:安装容器工具
由于我是在 Docker 容易中跑大模型, 所以还需要安装 container-toolkit 相关组件。否则会打印:
Error response from daemon: failed to discover GPU vendor from CDI: no known GPU vendor found
这个没什么难度,参考官方链接既可:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#with-apt-ubuntu-debian
总结:
在虚拟化环境中折腾深度学习环境,最大的敌人往往是“兼容性”。希望这篇记录能帮助到那些在“No devices were found”界面前抓狂的朋友。记住那个关键参数:-m=kernel-open,它能为您省下至少 80% 的排错时间。