CentOS集成GTX-1080Ti显卡搭建深度学习环境全过程

共计 10107 个字符，预计需要花费 26 分钟才能阅读完成。

CentOS 集成 GTX-1080Ti 显卡搭建深度学习环境全过程

在一个由 N 多台普通的不能再普通的机器攒凑起来的机箱中，搭载了最强核心——NVIDIA GeForce GTX 1080 Ti。我们的深度学习环境搭建之旅，将从如何攒凑这款独一无二的机箱开始，一点一点完成从简单电脑维修装配到逼格爆棚的 Deep Learning 的华丽丽转身。

一、安装系统 centos 7.3

使用 UltraISO 制作的 U 盘启动盘，镜像使用的是 CentOS-7-x86_64-DVD-1611.iso

安装过程略去。可以参考 http://www.linuxidc.com/Linux/2016-09/135593.htm 与 http://www.linuxidc.com/Linux/2014-10/108014.htm

二、准备显卡及相关驱动

硬件准备

由于 GeForce GTX-1080Ti 需要两个 8pin 接口供电，且比较耗能，功率为 280w，公司没有闲置的能满足需求的大功率电源，即使有的话，8pin 的供电的接口一般都不够用（小主板一般为 4pin）。最终方案是显卡单独供电，从几个不用的电源上攒出来包含两个 8pin（此 8pin 接口实为 3 根黄线和 3 根黑线，另外两根黑线为黑线接口引线串联起来，见下图 ）接口的电源（额定功率为 270w），另外一个给机箱主板供电的电源（额定功率 270w）除去本身的 24pin 接口和硬盘供电接口外，再拼接一条 8 接口（ 此 8pin 实为 4 根黄线和 4 根黑线组成），以弥补供电不足（PS：电源功率足够大的话阔以忽略以上折腾活儿）。

显卡 8pin 接口 CentOS 集成 GTX-1080Ti 显卡搭建深度学习环境全过程

主板辅助电源 8pin 接口 CentOS 集成 GTX-1080Ti 显卡搭建深度学习环境全过程

最终机箱全貌 CentOS 集成 GTX-1080Ti 显卡搭建深度学习环境全过程

补充：目前主板供电的接口主要有 24 针与 20 针两种，在中高端的主板上，一般都采用 24PIN 的主板供电接口设计，低端的产品一般为 20PIN。不论采用 24PIN 和 20PIN，其插法都是一样的。

另外随着 CPU、显卡等功耗增大，主板上增加了 4pin 或 8pin 的辅助供电

软件准备

所需文件一览：

-rwxr-xr-x 1 root root   97546170 9 月  20 10:24 cuda_8.0.61.2_linux-run            cuda 补丁      
-rwxr-xr-x 1 root root 1465528129 9 月  20 10:27 cuda_8.0.61_375.26_linux-run       cuda8.0 驱动  
-rw-r--r-- 1 root root  201134139 9 月  20 10:27 cudnn-8.0-linux-x64-v6.0.tgz       cudnn6.0（针对 cuda8.0）-rwxr-xr-x 1 root root   80803084 9 月  20 10:27 NVIDIA-Linux-x86_64-384.69.run     NVIDIA 驱动

1. 安装前准备工作

更新系统（时间较长，耐心等待）
```
# yum update
# reboot
```

安装 kernel-devel Package

# yum install kernel-devel-$(uname -r) gcc

禁用 nouveau 驱动

首先说明下什么是 Nouveau，为什么有些系统安装 N 卡驱动的时候会提示“ERROR: The Nouveau kernel driver is currently in use by your system. This Nouveau 是由第三方为 NVIDIA 显卡开发的一个开源 3D 驱动，也没能得到 NVIDIA 的认可与支持。虽然 Nouveau Gallium3D 在游戏速度上还远远无法和 NVIDIA 官方私有驱动相提并论，不过确让 Linux 更容易的应对各种复杂的 NVIDIA 显卡环境，让用户安装完系统即可进入桌面并且有不错的显示效果，所以，很多 Linux 发行版默认集成了 Nouveau 驱动，在遇到 NVIDIA 显卡时默认安装。企业版的 Linux 更是如此，几乎所有支持图形界面的企业 Linux 发行版都将 Nouveau 收入其中。
```
# echo 'blacklist nouveau' >> /etc/modprobe.d/blacklist.conf
```
验证是否成功禁用 nouveau，重启后执行如下命令没有输出则没有启用 nouveau。
```
# lsmod | grep nouveau
```
以上方法如果行不通的话，可能的原因是因为系统启动方式不太一样，具体不是特别了解，可采用如下方式禁用 nouveau 驱动：
- Disable X Windows
```
# sudo systemctl set-default multi-user.target
```
- Remove Nouveau
```
# sudo rpm -e xorg-x11-drivers xorg-x11-drv-nouveau
```
- Blacklist Nouveau
  
  a) edit /etc/modprobe.d/blacklist.conf and add line: blacklist nouveau
  b) edit /etc/default/grub and append to GRUB_CMDLINE_LINUX: rdblacklist=nouveau
  (After step 5, if the nouveau driver is still running, try rd.driver.blacklist=nouveau here instead. Sometimes rdblacklist=nouveau fails.)
  **-If you have an encrypted root drive, remove“rhgb”from GRUB_CMDLINE_LINUX.**This will allow you to interact with the encryption passphrase prompt, since Plymouth doesn’t seem to run without a framebuffer friendly video driver loaded.
  c) Two options for booting now days are BIOS and EFI (Make sure that /boot and/or /boot/efi are mounted before executing these commands.)
  -If you chose BIOS boot run this command:
  grub2-mkconfig -o /boot/grub2/grub.cfg
  -If EFI boot on CentOS:
  grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
  -If EFI boot on RHEL/Scientific Linux:
  grub2-mkconfig -o /boot/efi/EFI/RedHat/grub.cfg
- Reboot and Install NVidia Driver (安装显卡驱动)
- Enable X Windows (If Desired)（安装完显卡驱动后执行）
```
# sudo systemctl set-default graphical.target
# sudo reboot
```
备份 initramfs 文件（系统内核镜像文件）

initramfs 文件初探

Linux 内核在初始化之后会执行 init 进程，而 init 进程会挂载我们的根文件系统，但由于 init 程序也是在根文件系统上的，所以这就有了悖论。Linux 采用两步走的方法来解决这个问题。Linux2.6 版以前的方法是：除了内核 vmlinuz 之外还有一个独立的 initrd.img 映像文件，其实它就是一个文件系统映像，linux 内核在初始化后会 mount initrd.img 作为一个临时的根文件系统，而 init 进程就是在 initrd.img 里的，然后 init 进程会挂载真正的根文件系统，然后 umount initrd.img。但 Linux2.6 内核的实现方式却不太一样，虽然完成的功能是一样的。Linux2.6 采用 initramfs。initramfs：init ram filesystem，它是一个 cpio 格式的内存文件系统，制作的方法有两个，一个是 http://blog.csdn.net/htttw/article/details/7215858 介绍的，但这样做出来的 initramfs 是和内核 vmlinuz 分开的，因此我们需要在 grub 里写上 initramfs 的路径。而另一种方法是把内核和 initramfs 制作在一起成为一个文件，方法是在 linux 源码 make menuconfig，然后 General setup–> 选择 Initial RAM filesystem and RAM disk (initramfs/initrd) support，然后在 Initramfs source file(s)里输入我们的 initramfs 目录，然后 make bzImage。这种方法做出来的内核就只有一个文件，不需要指定 initramfs 了。
```
# sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
```
重建 initramfs 文件（更新系统 kernel 后正常重启会在 /boot 目录下自动生成新内核镜像，可选）
```
# sudo dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
```
关闭图形界面(如果有 VNC 服务, 请先关闭 vnc 服务)，执行
```
# sudo systemctl disable gdm
# sudo reboot 
```

2. 安装显卡驱动

下载对应型号的 NVidia 驱动，此处下载最新支持 GTX1080 Ti 的驱动 NVIDIA-Linux-x86_64-384.69.run。点击官方下载。

下载完后执行, 按提示选择进行安装：

# ./NVIDIA-Linux-x86_64-384.69.run

配置 X11 (在安装驱动的时候有自动执行 nvidia-xconfig 选项, 如果选 yes 可以确定跳过此步骤)

# nvidia-xconfig 

WARNING: Unable to locate/open X configuration file.

New X configuration file written to '/etc/X11/xorg.conf'

启动图形界面

# systemctl enable gdm
Created symlink from /etc/systemd/system/display-manager.service to /usr/lib/systemd/system/gdm.service.
# reboot

验证

执行命令查看显卡状态
```
# nvidia-smi
```
在页面右上角的 应用程序 -> 其他 -> Nvidia 选项中，有如下 GPU 0 选项，说明安装成功（找的样图，型号和驱动不同）

【注】：如果中途有问题, 可以执行 ./NVIDIA-Linux-x86_64-384.69.run --uninstall 进行卸载.

3. 安装 cuda

【cuda-8.0 驱动下载】【cuda-8.0 补丁下载】【其他版本驱动下载】

科普下 cuda，简单来说就是能使用 GPU 并行计算的平台和编程模型, 能极大提升计算性能。

Q: What is CUDA?
CUDA® is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed base of hundreds of millions of CUDA-enabled GPUs in notebooks, workstations, compute clusters and supercomputers. Applications used in astronomy, biology, chemistry, physics, data mining, manufacturing, finance, and other computationally intense fields are increasing using CUDA to deliver the benefits of GPU acceleration.

再次进入 text mode
```
# init 3
```

安装依赖库：

# yum -y install gcc-c++
# yum -y install epel-release
# yum -y install --enablerepo=epel dkms
# yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

安装 cuda-8.0 驱动（一定注意下图安装过程中的红框部分，跳过安装 NVIDIA 驱动的步骤（输入 n））：
```
# ./cuda_8.0.61_375.26_linux-run 
```
安装 cuda 补丁
```
# ./cuda_8.0.61.2_linux-run  
```

设置环境变量

# vim /etc/profile

最后一行添加

export PATH="/usr/local/cuda-8.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64"
export CUDA_HOME="/usr/local/cuda"

使之生效

# source /etc/profile

4. 安装 cuDNN

科普下 cudnn，简单点说 cuDNN 是 NVIDIA 深度学习中的深层神经网络库，并通过 cuda，我们就可以直接利用 Nvidia GPU 的并行计算能力进行深度学习训练任务了。

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK.

下载 cudnn-8.0-linux-x64-v6.0.tgz

下载 cuDNN, 需要创建一个 NVIDIA 账号，cuDNN Home Page.

不想注册的话，可以通过如下方式获取
```
wget http://developer.download.nvidia.com/compute/redist/cudnn/v6.0/cudnn-8.0-linux-x64-v6.0.tgz
```

解压并拷贝相关文件，完成安装

# tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz 
# sudo cp cuda/include/cudnn.h /usr/local/cuda/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
# sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

三、python 3.6.2 编译安装及配置

1. 安装依赖库

yum install gcc-c++ sqlite-devel gcc zlib zlib-devel bzip2-devel  openssl-devel ncurses-devel tcl tcl-devel tk tk-devel

2. 安装 Python3

【python-3.6.2 源码下载】

将 python 源码放置到一个无害的目录, 执行命令：

# sudo mkdir /usr/local/python3
# tar -xzvf Python-3.6.2.tgz
# cd Python-3.6.2 # 进入解压目录
# sudo ./configure --prefix=/usr/local/python3  --enable-optimizations # 指定创建的目录

修改 Setup 文件

vim Modules/Setup

修改结果如下（将行开头的注释“#”去掉）：

# Socket module helper for socket(2)
_socket socketmodule.c

# Socket module helper for SSL support; you must comment out the other
# socket line above, and possibly edit the SSL variable:
#SSL=/usr/local/ssl
_ssl _ssl.c \
-DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \
-L$(SSL)/lib -lssl -lcrypto

编译安装（时间较长，耐心等待）

sudo make
sudo make install

3. 与 Python 2.7.5 共存

创建链接 python3，这样就可以通过 python 命令使用 Python 2，python3 来使用 Python 3。

sudo ln -s /usr/local/python3/bin/python3 /usr/bin/python3

安装 pip

sudo ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3

pip 升级

pip3 install --upgrade pip

四、python 机器学习和深度学习库安装

1. 安装机器学习相关库

pip3 install -U wheel

pip3 install -U numpy 

pip3 install -U scipy

pip3 install -U matplotlib

pip3 install -U pandas

pip3 install -U scikit-learn

pip3 install -U ipython jupyter notebook

pip3 install -U spyder

pip3 install -U dill

2. 安装深度学习相关库

yum install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig

pip3 install -U h5py

pip3 install -U tensorflow-gpu

pip3 install -U scikit-image

pip3 install -U keras

pip3 install -U keras-rl  # 强化学习(可选)

pip3 install -U keras-vis # 可视化（可选）pip3 install -U gym[all]（可选，安装时稍微有些麻烦，暂时没用到，未安装）

五、jupyter 配置

1. 生成默认配置

# cd /usr/local/python3/bin
# ./jupyter notebook --generate-config
Writing default config to: /home/user/.jupyter/jupyter_notebook_config.py

2. 编辑配置文件

# vim /home/user/.jupyter/jupyter_notebook_config.py  

c.NotebookApp.ip = '*'               #所有绑定服务器的 IP 都能访问
c.NotebookApp.port = 8899            #将端口设置为自己喜欢的吧，默认是 8888
c.NotebookApp.open_browser = False   #我们并不想在服务器上直接打开 Jupyter Notebook，所以设置成 False
c.NotebookApp.notebook_dir = '/data/bigdata/notebooks/'   #设置 notebook 工作目录

3. 设置 Jupyter Notebook 密码

在 python 控制台 , 输入

In [1]: from notebook.auth import passwd
In [2]: passwd()

Enter password: *******
Verify password:*******
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'   #密码密文

修改配置密码信息

# vim /home/you/.jupyter/jupyter_notebook_config.py  

c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'  #注意前面的 u

4. 启动

设置为开机自启动服务

# cd ~
# mkdir services
# vim jupyter.service

在 jupyter.service 中输入

[Unit]
Description=Jupyter notebook env

[Service]
User=root
Group=root
Environment=LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64 #添加此环境变量，避免启动 tensorFlow 时找不到某些类库
ExecStart=/usr/local/python3/bin/jupyter notebook --allow-root

[Install]
WantedBy=multi-user.target

保存退出

# cp ./jupyter.service /usr/lib/systemd/system/jupyter.service
# ln -s /usr/lib/systemd/system/jupyter.service jupyter.service  #同时修改此两个文件

开机自启动

# systemctl enable jupyter.service

启动服务

# systemctl start jupyter.service

PS: 单独启动一次可通过如下方式（必须在用户目录下, 如 /root 目录下执行）

nohup /usr/local/python3/bin/jupyter notebook --allow-root  >/dev/null 2>&1 &

六、测试下环境

至此，深度学习所用到的环境搭建完成，下面我们就拿训练人脸识别的模型做个测试吧。

bingo，这一天的工作，值了。

问题

1. 运行 init 3 报错

注意：如果报如下错误，可能是因为 /boot 下空间不足了

[root@localhost GPU]# init 3

Broadcast message from systemd-journald@localhost.localdomain (Wed 2017-09-20 11:26:40 CST):

dracut[11820]: dracut: creation of /boot/initramfs-3.10.0-693.2.2.el7.x86_64kdump.img failed

Message from syslogd@localhost at Sep 20 11:26:40 ...
 dracut:dracut: creation of /boot/initramfs-3.10.0-693.2.2.el7.x86_64kdump.img failed

查看 /boot 空间

[root@localhost boot]# df /boot
文件系统        1K- 块   已用  可用 已用 % 挂载点
/dev/sdb1      255724 253488  2236  100% /boot

查看老版本的内核版本

[root@localhost boot]# rpm -qa|grep kernel
kernel-3.10.0-514.26.2.el7.x86_64
kernel-tools-libs-3.10.0-693.2.2.el7.x86_64
kernel-headers-3.10.0-693.2.2.el7.x86_64
kernel-tools-3.10.0-693.2.2.el7.x86_64
kernel-devel-3.10.0-693.2.2.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64
kernel-3.10.0-514.el7.x86_64
abrt-addon-kerneloops-2.1.11-48.el7.centos.x86_64

卸载老版本内核

[root@localhost boot]# rpm -e kernel-3.10.0-514.el7.x86_64 kernel-3.10.0-514.26.2.el7.x86_64

本文永久更新链接地址：http://www.linuxidc.com/Linux/2017-12/149577.htm

CentOS集成GTX-1080Ti显卡搭建深度学习环境全过程

CentOS 集成 GTX-1080Ti 显卡搭建深度学习环境全过程

一、安装系统 centos 7.3

二、准备显卡及相关驱动

硬件准备

1. 安装前准备工作

2. 安装显卡驱动

3. 安装 cuda

4. 安装 cuDNN

三、python 3.6.2 编译安装及配置

1. 安装依赖库

2. 安装 Python3

3. 与 Python 2.7.5 共存

四、python 机器学习和深度学习库安装

1. 安装机器学习相关库

2. 安装深度学习相关库

五、jupyter 配置

1. 生成默认配置

2. 编辑配置文件

3. 设置 Jupyter Notebook 密码

4. 启动

六、测试下环境

问题

1. 运行 init 3 报错

基于开源MaxKB构建大语言模型的本地知识库系统

获取各大人工智能AI工具通过API和KEY调用的方法

安装开源软件ChatALL（齐叨）来聚合各大人工智能工具

给你的NAS无限可能，安装小晓雅全家桶影音库

vmware下的网卡分配问题

Redis 高可用性实践

解读SQL Server 性能优化指标

Linux中的cat more less xxd 区别

Ansible中替换模块介绍

如何使用serverchan微信推送告警