GPU服务器开局教程
本次实验环境:Ubuntu 22.04 LTS
CAUTION拥有一个良好的网络环境是此类折腾的基础
旁边要有个正常的AI随时可以问
网络代理
-
最优解决方案设置旁路由,将流量发到旁路由之上
-
使用xray(XTLS/Xray-install: Easiest way to install & upgrade Xray. (github.com),配置运行 | Project X (xtls.github.io))
-
变相实现透明代理
- 使用mihomo转发全局流量(利用Clash进行透明代理的抓包新姿势 - Is Yang’s Blog (isisy.com))
- 使用tun2socks转发全局流量(Examples · xjasonlyu/tun2socks Wiki (github.com))
驱动安装
-
若为Ubuntu系统,优先使用
sudo ubuntu-drivers
,在 8卡 V100中sudo ubuntu-drivers autoinstall
可自动安装显卡驱动以及cuda环境(好评!!!)-
在A100中使用上述安装方法(未成功安装),官方源下载奇慢—>换清华源—>驱动缺失cuda
-
可能原先环境未清除干净
- 所使用的卸载命令
sudo apt purge nvidia*
- 但cuda目录,以及环境变量未进行清理(猜测环境变量占主要原因)
- 也有可能是内核中所加载的驱动未进行卸载
- 所使用的卸载命令
-
-
-
使用run方式进行驱动安装
-
-
Nvidia驱动下载(下载 NVIDIA 官方驱动 | NVIDIA)
-
删除原有驱动,安装依赖环境(Options)
-
sudo apt-get remove --purge nvidia*
-
sudo apt install gcc
-
sudo apt install make #安装驱动需要依赖
-
-
禁用nouveau驱动
-
Terminal window sudo gedit /etc/modprobe.d/blacklist.conf -
编辑 /etc/modprobe.d/blacklist-nouveau.conf 文件,末行添加
Terminal window blacklist nouveau -
Terminal window sudo update-initramfs -u #更新reboot
-
-
验证nouveau是否已禁用(没有返回信息显示,说明nouveau已被禁用)
Terminal window lsmod | grep nouveau -
安装nvidia驱动
Terminal window sudo chmod a+x NVIDIA-Linux-x86\_64-xxx.xx.run #给文件赋予执行权限sudo ./NVIDIA-Linux-x86\_64-xxx.xx.run -
安装完毕后重启验证
Terminal window nvidia-smi
-
-
Miniconda3安装
- 下载安装包Index of /anaconda/miniconda/ | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror
-
Terminal window bash Miniconda3-latest-Linux-x86_64.sh #替换为所下载的sh文件名 - 换源(Options)
cuda安装
- 若使用第一中安装方式,则不需要进行此步
- 若需要手动安装,则依据CUDA Toolkit 12.6 Downloads | NVIDIA Developer进行安装
Torch安装
- 以Start Locally | PyTorch此教程为基础进行安装
- 注意cuda版本与torch的匹配,高版本cuda可以兼容低版本,为避免问题最好与之相等
- 可以直接安装cuda然后torch在虚拟环境里面安装。我不好说跨环境引用情况
性能监控
sudo apt install btop
btop
pip3 install nvitop
nvitop -m
环境测试
-
cuda检测
Terminal window nvcc -version -
Nvidia 系统管理接口
Terminal window nvidia-smi -
Torch检测
import torchif torch.cuda.is\_available():print("GPU is available")else:print("GPU is not available")
GPU 测试
-
Multi-GPU CUDA stress test(wilicc/gpu-burn: Multi-GPU CUDA stress test (github.com))
Waiting for api.github.com...简单压测意义也就是压测。
可以参考这个操作 https://www.amaxchina.com/Support/TechDocument/Detail/349
-
一次性测试所有环境情况
conda create test
conda activate test
mkdir test
cd test
nano testspeed.py
import torchimport time# 这个是娱乐局测速,没什么实际意义,但是可以测试多卡互联情况。# 获取所有可用的GPU设备device_count = torch.cuda.device_count()# 目标显存占用量target_memory_usage_gb = 30target_memory_usage_bytes = target_memory_usage_gb * 1024**3# 计算每个张量的大小(假设float32,每个元素4字节)dtype_size_in_bytes = 4 # float32elements_per_tensor = target_memory_usage_bytes // dtype_size_in_bytes // device_count # 每个GPU分配部分# 确定张量的形状(假设为2D张量)tensor_shape = (int(elements_per_tensor ** 0.5), int(elements_per_tensor ** 0.5))# 在每个GPU上创建张量,达到目标显存占用tensors = [torch.randn(tensor_shape, dtype=torch.float32, device=f'cuda:{i}') for i in range(device_count)]# 开始计时start_time = time.time()duration = 60 # 秒try:while time.time() - start_time < duration:for i in range(device_count):# 记录开始时间calc_start_time = time.time()# 对每个GPU上的张量进行矩阵乘法,保持高计算负载result = torch.matmul(tensors[i], tensors[i])# 同步操作,确保计算完成torch.cuda.synchronize()# 计算时间和速度calc_duration = time.time() - calc_start_timenum_flops = 2 * tensor_shape[0] * tensor_shape[1] * tensor_shape[1] # 矩阵乘法 FLOPs 计算flops_per_sec = num_flops / calc_duration / 1e12 # 转换为 TFLOPsprint(f"GPU {i} 计算速度: {flops_per_sec:.2f} TFLOPs, 耗时: {calc_duration:.6f} 秒")except RuntimeError as e:print(f"计算时遇到错误: {e}")print(f"计算测试完成,已运行 {duration} 秒。")
python3 testspeed.py
nano testpower.py
import torchimport timeimport pynvml# 这个主要是为了持续性压力测试,至于显存我没想占满。# 初始化pynvml以监控GPU功率pynvml.nvmlInit()# 获取所有可用的GPU设备device_count = torch.cuda.device_count()# 目标显存占用量(记得修改这个)!target_memory_usage_gb = 30target_memory_usage_bytes = target_memory_usage_gb * 1024**3# 数据类型的字节数dtype_size_in_bytes = 4 # float32# 计算每个GPU上的张量数目num_tensors = 3 # 假设创建3个张量来达到目标显存占用elements_per_tensor = target_memory_usage_bytes // dtype_size_in_bytes // device_count // num_tensors# 确定张量的形状(假设为2D张量)tensor_shape = (int(elements_per_tensor ** 0.5), int(elements_per_tensor ** 0.5))# 在每个GPU上创建多个张量,达到目标显存占用tensors = []for i in range(device_count):gpu_tensors = [torch.randn(tensor_shape, dtype=torch.float32, device=f'cuda:{i}') for _ in range(num_tensors)]tensors.append(gpu_tensors)# 检查显存占用for i in range(device_count):torch.cuda.synchronize()print(f"GPU {i} 当前显存占用: {torch.cuda.memory_allocated(i) / (1024**3):.2f} GB")# 开始计时start_time = time.time()duration = 1600 # 秒,也就是测试时间try:while time.time() - start_time < duration:elapsed_time = time.time() - start_timeremaining_time = duration - elapsed_timeprint(f"剩余时间: {remaining_time:.1f} 秒")for i in range(device_count):# 获取GPU功率handle = pynvml.nvmlDeviceGetHandleByIndex(i)power_usage = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000 # 转换为瓦特print(f"GPU {i} 功率: {power_usage:.2f} W")for tensor in tensors[i]:# 对每个GPU上的张量进行矩阵乘法,保持高计算负载result = torch.matmul(tensor, tensor)# 确保计算完成torch.cuda.synchronize()time.sleep(0.001) # 每秒输出一次倒计时和功率except RuntimeError as e:print(f"计算时遇到错误: {e}")print(f"计算测试完成,已运行 {duration} 秒。")# 释放pynvml资源pynvml.nvmlShutdown()
python3 testpower.py
新遇到的问题
-
cuda runtime error (802) : system not yet initialized …/THCGeneral.cpp:50
TIP
(由题目自拟闯天涯实操过程中发现的问题)
Waiting for api.github.com...- 尝试编译并运行 NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Waiting for api.github.com...-
Terminal window git clone https://github.com/NVIDIA/cuda-samples.gitcd cuda-samples/Samples/bandwidthTestmake./bandwidthTest -
NOTE
nvcc
is going to be located in/usr/local/cuda/bin
注意: nvcc 将位于 /usr/local/cuda/bin -
运行后结果:
-
未安装 Data Center GPU 管理器
-
Terminal window > ./bandwidthTest[CUDA Bandwidth Test] - Starting...Running on...cudaGetDeviceProperties returned 802-> system not yet initializedCUDA error at bandwidthTest.cu:256 code=802(cudaErrorSystemNotReady) "cudaSetDevice(currentDevice)" -
意味着未安装 Data Center GPU 管理器
-
Terminal window wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600sudo apt-key adv --keyserver-options http-proxy=http://proxy-chain.intel.com:911 --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub-
若出现 unable to fetch (
网络环境好应该不会出现-
Terminal window > sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pubExecuting: /tmp/apt-key-gpghome.qjhmgicscb/gpg.1.sh --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pubgpg: requesting key from 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub'gpg: WARNING: unable to fetch URI https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub: Connection timed out -
那就设置代理
Terminal window sudo apt-key adv --keyserver-options http-proxy=<PROXY-ADDRESS:PORT> --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
-
-
-
然后开始安装 datacenter-gpu-manager
Terminal window sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86\_64/ /"sudo apt-get updatesudo apt-get install -y datacenter-gpu-manager -
终止主机引擎
Terminal window sudo nv-hostengine -t -
启动 fabricManager
Terminal window sudo service nvidia-fabricmanager start-
如果出现
Terminal window > sudo service nvidia-fabricmanager startFailed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not found. -
安装 Fabric Manager并启动
Terminal window sudo apt-get install cuda-drivers-fabricmanager-<version>sudo service nvidia-fabricmanager start
-
-
-
-
再次运行
bandwidthTest
Terminal window > ./bandwidthTest[CUDA Bandwidth Test] - Starting...Running on...Device 0: NVIDIA A100-SXM4-40GBQuick ModeHost to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 26.1Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 25.6Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes) Bandwidth(GB/s)32000000 1152.7Result = PASSNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.> ipython #torch检测Python 3.7.11 (default, Jul 27 2021, 14:32:16)Type 'copyright', 'credits' or 'license' for more informationIPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.In [1]: import torchtorIn [2]: torch.cuda.is_available()Out[2]: True
具体应用
- ollama(已实现,详见另一篇文章)
- vllm(待实现,据说对多卡推理优化极好)