【技術分享】重灌後,安裝 GPU 驅動以及 Docker

Posted by Hsiangj-Jen Li on 2023-09-27
Estimated Reading Time 4 Minutes
Words 951 In Total
Viewed Times

hackmd-github-sync-badge

重灌後,安裝 GPU 驅動以及 Docker

設備規格

  • CPU : i7-10700
  • GPU : RTX-3060
  • 系統 : Ubuntu 20.04

NIVIDIA GPU 驅動

確認有沒有安裝到 GPU 驅動程式

1
nvidia-smi
1
>>> Command 'nvidia-smi' not found, but can be installed with:

安裝

安裝的教學可以參考 [Linux] Ubuntu 安裝、移除 NVIDIA 顯示卡驅動程式(Driver)教學

1
sudo add-apt-repository ppa:graphics-drivers

查詢可以安裝哪個版本的驅動

1
2
sudo apt-get update
sudo apt-cache search nvidia-driver-*

我是安裝 nvidia-driver-535,理論 install 完要自己下指令去 reboot,但是不知道為何螢幕直接黑掉,但是重開後就裝好了…

1
2
sudo apt-get install nvidia-driver-535
# sudo reboot

安裝成功後的畫面

要出現這個才算安裝成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 On | N/A |
| 0% 51C P8 19W / 170W | 504MiB / 12288MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1130 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 24470 G /usr/lib/xorg/Xorg 149MiB |
| 0 N/A N/A 24633 G /usr/bin/gnome-shell 34MiB |
| 0 N/A N/A 26275 G /usr/lib/firefox/firefox 258MiB |
| 0 N/A N/A 42730 G gnome-control-center 2MiB |
+---------------------------------------------------------------------------------------+

安裝 Docker + 可以執行 GPU 的 Container

要在 docker 裡面執行 GPU 的話要安裝:

  1. docker-ce(community edition)
  2. NVIDIA Container Toolkit

原則上按照 docker 官方的安裝即可,如果有出現任何 ERROR 請到下面的 ERROR 區查看解法。這邊就不提供安裝 Docker Desktop 的教學,因為好像安裝了 Docker Desktop 就沒辦法在 Container 裡面使用 GPU,參考自 nvidia-docker

確認電腦都沒有舊的 docker

1
for pkg in docker.io docker-doc docker-compose podman-docker containerd runc; do sudo apt-get remove $pkg; done

開始安裝 Docker

1
2
3
4
5
6
7
8
9
10
11
12
13
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
1
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

安裝完後在終端機打上

1
docker

應該會出現一大串,像是下面這樣,這樣就算安裝成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Usage:  docker [OPTIONS] COMMAND

A self-sufficient runtime for containers

Common Commands:
run Create and run a new container from an image
exec Execute a command in a running container
ps List containers
build Build an image from a Dockerfile
pull Download an image from a registry
push Upload an image to a registry
images List images
login Log in to a registry
logout Log out from a registry
search Search Docker Hub for images
version Show the Docker version information
info Display system-wide information

開始安裝 NVIDIA Container Toolkit

1
2
3
4
5
6
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
1
sudo apt-get install -y nvidia-container-toolkit
1
2
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

測試是否可以執行 GPU

執行 docker image

1
sudo docker run -it --gpus all pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime bash

進入 docker container 的終端機,類似下面的畫面後輸入 python 後就會進入 python 的 console

依序輸入下面,如果都有正常顯示就是裝成功了

1
2
3
4
import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name())

ERROR

docker.socket: Failed with result ‘service-start-limit-hit’

如果打這個指令會出現上面的 ERROR 的解法

1
sudo systemctl restart docker

解決方案參考自網路上的方法,將檔案名稱更改即可 daemon.json --> daemon.conf

1
sudo mv /etc/docker/daemon.json /etc/docker/daemon.conf