重灌後,安裝 GPU 驅動以及 Docker
CPU : i7-10700
GPU : RTX-3060
系統 : Ubuntu 20.04
確認有沒有安裝到 GPU 驅動程式
1 > >> Command 'nvidia-smi' not found, but can be installed with:
安裝的教學可以參考 [Linux] Ubuntu 安裝、移除 NVIDIA 顯示卡驅動程式(Driver)教學
1 sudo add-apt-repository ppa:graphics-drivers
1 2 sudo apt-get update sudo apt-cache search nvidia-driver-*
我是安裝 nvidia-driver-535
,理論 install 完要自己下指令去 reboot,但是不知道為何螢幕直接黑掉,但是重開後就裝好了…
1 2 sudo apt-get install nvidia-driver-535
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104 .05 Driver Version: 535.104 .05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000 :01 :00 .0 On | N/A | | 0 % 51C P8 19W / 170W | 504MiB / 12288MiB | 5 % Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1130 G /usr/lib/xorg/Xorg 35MiB | | 0 N/A N/A 24470 G /usr/lib/xorg/Xorg 149MiB | | 0 N/A N/A 24633 G /usr/bin/gnome-shell 34MiB | | 0 N/A N/A 26275 G /usr/lib/firefox/firefox 258MiB | | 0 N/A N/A 42730 G gnome-control-center 2MiB | +---------------------------------------------------------------------------------------+
安裝 Docker + 可以執行 GPU 的 Container
要在 docker 裡面執行 GPU 的話要安裝:
docker-ce(community edition)
NVIDIA Container Toolkit
原則上按照 docker 官方 的安裝即可,如果有出現任何 ERROR 請到下面的 ERROR 區 查看解法。這邊就不提供安裝 Docker Desktop 的教學,因為好像安裝了 Docker Desktop 就沒辦法在 Container 裡面使用 GPU,參考自 nvidia-docker 。
確認電腦都沒有舊的 docker
1 for pkg in docker.io docker-doc docker-compose podman-docker containerd runc; do sudo apt-get remove $pkg ; done
開始安裝 Docker
1 2 3 4 5 6 7 8 9 10 11 12 13 # Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl gnupg sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg sudo chmod a+r /etc/apt/keyrings/docker.gpg # Add the repository to Apt sources: echo \ "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update
1 sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Usage: docker [OPTIONS] COMMAND A self-sufficient runtime for containers Common Commands: run Create and run a new container from an image exec Execute a command in a running container ps List containers build Build an image from a Dockerfile pull Download an image from a registry push Upload an image to a registry images List images login Log in to a registry logout Log out from a registry search Search Docker Hub for images version Show the Docker version information info Display system-wide information
1 2 3 4 5 6 curl -fsSL https: && curl -s -L https: sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \ && \ sudo apt-get update
1 sudo apt-get install -y nvidia-container-toolkit
1 2 sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
測試是否可以執行 GPU
執行 docker image
1 sudo docker run -it --gpus all pytorch/pytorch:2 .0 .1 -cuda11 .7 -cudnn8 -runtime bash
進入 docker container 的終端機,類似下面的畫面後輸入 python
後就會進入 python 的 console
1 2 3 4 import torchprint (torch.cuda.is_available())print (torch.cuda.get_device_name())
docker.socket: Failed with result ‘service-start-limit-hit’
如果打這個指令會出現上面的 ERROR 的解法
1 sudo systemctl restart docker
解決方案參考自網路上的方法 ,將檔案名稱更改即可 daemon.json
--> daemon.conf
1 sudo mv /etc/ docker/daemon.json / etc/docker/ daemon.conf