Since its been a while I decided to upgrade my ml box to cuda 9.0, man that was fun, lots of googling with multiple visits to ubuntu and nvidia forums and reading up on several blog posts and stackoverflow articles and almost at the end of the long day am running cuda 9.0, Cudnn 7 and tensorflow 1.5 GPU enabled with models with Keras 2.1.x.
the short version is almost 80% of problems were from lingering packages and changes made to the machine during the last install . So the key is to make sure you roll back and remove the packages cleanly before proceeding. the final step is actually very simple, good job nvidia!.
first we need to remove all the old packages installed
sudo apt-get purge nvidia-* -y
sudo apt-get purge cuda-* -y
sudo apt-get purge libcuda* -y
sudo apt-get purge libcudnn* -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y
sudo apt-get update
Then remove any repo’s that you have added
sudo rm /etc/apt/sources.list.d/nvidia-diag-driver-local-384.66.list
sudo rm /etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-xenial.list
Then make sure there is nothing left over.
sudo dpkg --list | grep nvidia
sudo dpkg --list | grep cuda
sudo dpkg --list | grep libcudnn
If you find any packages use dpkg to remove them, ex:
sudo dpkg --purge libcudnn5
sudo dpkg --purge cuda-repo-ubuntu1604
sudo dpkg --purge cuda-cudart-8-0 cuda-cudart-dev-8-0 cuda-cufft-8-0 cuda-curand-8-0 cuda-cusolver-8-0 cuda-cusparse-8-0 cuda-npp-8-0 cuda-nvgraph-8-0 cuda-nvrtc-8-0 cuda-toolkit-9-0
revert gcc and g++ to ver 5 as the latest theano and tf have been updated.
sudo ln -s /usr/bin/gcc-5 /usr/bin/gcc -f
sudo ln -s /usr/bin/g++-5 /usr/bin/g++ -f
now reboot the machine and then once it loads make sure there is not old packages and nvidia kernel module is not loaded
lsmod | grep nvidia
now install the cuda repo package and add the cuda gpk keys before installing the cuda meta package.
sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda-9-0 -y
This option seems to have been significantly improved, it automatically installed the correct nvidia drivers (390.30) via the cuda-drivers package and the blas package (cuda-cublas-9-0) without any mucking around from the user, it does take a while though.
Once its complete, go ahead and reboot the machine and once its back up you should have the nvidia module loaded
lsmod | grep nvidia
nvidia_uvm 761856 4
nvidia_drm 40960 0
nvidia_modeset 1093632 1 nvidia_drm
drm_kms_helper 155648 1 nvidia_drm
drm 364544 3 drm_kms_helper,nvidia_drm
nvidia 14327808 494 nvidia_modeset,nvidia_uvm
ipmi_msghandler 49152 3 ipmi_ssif,nvidia,ipmi_si
also run nvidia-smi
nvidia-smi
Fri Mar 2 00:20:04 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:03:00.0 Off | N/A |
| 0% 37C P0 33W / 166W | 0MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1070 Off | 00000000:04:00.0 Off | N/A |
| 0% 39C P5 15W / 166W | 0MiB / 8119MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
if any of the above does not work, remember to update the .bashrc PATH variables to the cuda 9.0 folder
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/include${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME="/usr/local/cuda"
export MKL_THREADING_LAYER=GNU
If you have come this far, installing theano or tensorflow is pretty trivial these days thanks to anaconda python distribution, in my case i use the miniconda installer and then install the required packages and dependencies.
wget --quiet http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
/bin/bash ~/miniconda.sh -b -p /opt/conda
export PATH="/opt/conda/bin:$PATH"
conda install --quiet --yes keras tensorflow theano ipython pandas scipy scikit-learn mkl-service
export MKL_THREADING_LAYER=GNU
MKL_THREADING_LAYER is only need for theano.
Leave a Reply