Sometime over the weekend my lab vCenter Server Appliance stopped working. when attempting to login you see the infamous 503 error

"503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x7f0ef806c180] _serverNamespace = / _isRedirect = false _pipeName =/var/run/vmware/vpxd-webserver-pipe)

Issue 1: Disk full
Log in to the vCenter Server Appliance through SSH and enable bash shell with the following command

shell.set --enabled true
shell

now you are in shell run df and found the disk was full on couple of the /var/log partitions. after a few du -sh * command I identified the issue to be 3.6GB audit.log file under /var/log/audit
after deleting that and a few other large log files, thought I resolved the issue.

Issue 2: root user password
On reboot the system was still failing, checked the audit.log file again and realized there was some auth issue, based on this VMware KB looks like I need to change the password for user root. chage -l root once this is done, reboot again.

Issue 3: vmware-vpxd service still fails, dependent service vmware-invsvc fails becasue dependent server vmware-eam fails.

Stderr =
2018-06-05T22:31:09.816Z   {
    "resolution": null,
    "detail": [
        {
            "args": [
                "Command: ['/sbin/service', u'vmware-eam', 'start']\nStderr: "
            ],
            "id": "install.ciscommon.command.errinvoke",
            "localized": "An error occurred while invoking external command : 'Command: ['/sbin/service', u'vmware-eam', 'start']\nStderr: '",
            "translatable": "An error occurred while invoking external command : '%(0)s'"
        }
    ],
    "componentKey": null,
    "problemId": null
}
ERROR:root:Unable to start service vmware-eam, Exception: {
    "resolution": null,
    "detail": [
        {
            "args": [
                "vmware-eam"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-eam'",
            "translatable": "An error occurred while starting service '%(0)s'"
        }
    ],
    "componentKey": null,
    "problemId": null
}
Unable to start service vmware-eam, Exception: {
    "resolution": null,
    "detail": [
        {
            "args": [
                "vmware-eam"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-eam'",
            "translatable": "An error occurred while starting service '%(0)s'"
        }
    ],
    "componentKey": null,
    "problemId": null
}

On more mucking around and looking into /var/log/vmware/eam/eam.log it complained about empty fields so when i looked at /etc/vmware-eam/eam.properties it was empty.
thanks to some kind soul on vmware forums I found a template file and punched in my details
unfortunately there was a error in line 34 should not start with eameam just eam.

the four lines you need to change are

eameam.resourcebundle.filename=eam-resourcebundle.jar --> eam.resourcebundle.filename=eam-resourcebundle.jar
cm.url=http://localhost:18090/cm/sdk/?hostid=fa3bd34a-4b3c-4ff0-ad4f-1c97ad6f93b5  (from cat /etc/vmware/install-defaults/sca.hostid)
vc.tunnelSdkUri.template=https://##{VC_HOST_NAME}##:8089/sdk/vimService  (from cat hostname)
vc.tunnelSdkUri=https://vcenter:8089/sdk/vimService  (same as above)

After this the eam server was working
service-control --start vmware-eam

Issue 4: moving on to vmware-invsvc which was also stuck.
the log file had a following error


[WrapperListener_start_runner  ERROR com.vmware.vim.vcauthenticate.servlets.AuthenticationHelper  opId=] Hit ServiceFaultException while fetching adm
in group for the SSO Admin user : [email protected] and
[WrapperListener_start_runner WARN org.springframework.context.support.ClassPathXmlApplicationContext opId=] Exception encountered during context initialization - cancelling refresh attempt
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'vlsi-server' defined in class path resource [server/config/server-config.xml]: Cannot create inner bean 'com.vmware.vim.vmomi.server.http.impl.FilterImpl#2ad6d4be' of type [com.vmware.vim.vmomi.server.http.impl.FilterImpl] while setting bean property 'filters' with key [0]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.vmware.vim.vmomi.server.http.impl.FilterImpl#2ad6d4be' defined in class path resource[server/config/server-config.xml]: Cannot resolve reference to bean 'authFilter' while setting bean property 'filter'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'authFilter' defined in class path resource [server/config/server-config.xml]: Cannot resolve reference to bean 'authChecker' while setting bean property 'authChecker'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'authChecker' defined in class path resource [server/config/security-config.xml]: Cannot resolve reference to bean 'userSessionManager' while setting bean property 'userSessionManager'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'userSessionManager' defined in class path resource [server/config/security-config.xml]: Cannot resolve reference to bean 'authorizationManager' while setting bean property 'authorizationManager'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'authorizationManager' defined in class path resource [server/config/security-config.xml]: Cannot resolve reference to bean 'authProvider' while setting bean property 'dataProvider'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'authProvider' defined in class path resource [server/config/security-config.xml]: Cannot resolve reference to bean 'memCache' while setting bean property 'parentChainCache'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'memCache' defined in class path resource [server/config/security-config.xml]: Cannot resolve reference to bean 'globalAclLotusCache' while setting bean property 'globalAclLotusCache'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'aclLotusInitializer' defined in class path resource [server/config/authorization-config.xml]: Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [com.vmware.vim.query.server.accesscontrol.impl.LotusInitializer]: Constructor threw exception; nested exception is java.lang.RuntimeException: com.vmware.identity.interop.ldap.Invalid
CredentialsLdapException: Invalid credentials LDAP error [code: 49] at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveInnerBean(BeanDefinitionValueResolver.java:287)

I chased after VMware KB and reset the password with /usr/lib/vmware-vmdir/bin/vdcadmintool and choosing option 3 and then setting that new password in the registry


/opt/likewise/bin/lwregshell
cd HKEY_THIS_MACHINE\services\vmdir\
set_value dcAccountPassword "new password"
quit

Final reboot, and login page and logged in, its all working!!, only took 4+ hours. Time to upgrade to 6.5 or 6.7 ūüôā

If you are behind a proxy and want to proxy docker registry or have multiple machines pulling the same images over and over (CI/CD/ML/DL etc..) and just want to cache them locally the following is a good choice.

create a folder docker-registry-local-cache and create docker-compose.yml file as follows and customize it with your env variables.

vi docker-compose.yml

version: "2"
services:
  registry2:
    image: registry:2
    ports:
      - 5000:5000
    environment:
      - REGISTRY_PROXY_REMOTEURL="https://registry-1.docker.io"
      - HTTP_PROXY=example.com:80
      - HTTPS_PROXY=example.com:80
      - NO_PROXY="localhost,127.0.0.1,10.0.0.7"
      - no_proxy="localhost,127.0.0.1,10.0.0.7"
    volumes:
      - /opt/registry:/var/lib/registry

run the container with

docker-compose up -d

run
docker logs dockerregistrylocalcache_registry2_1
and you should see the following

time="2018-04-04T23:18:28Z" level=info msg="Registry configured as a proxy cache to https://registry-1.docker.io" go.version=go1.7.6 instance.id=76e861f3-cd4c-463e-880f-847d152cb565 version=v2.6.2
time="2018-04-04T23:18:28Z" level=info msg="listening on [::]:5000" go.version=go1.7.6 instance.id=76e861f3-cd4c-463e-880f-847d152cb565 version=v2.6.2

run
curl http://10.0.0.7:5000/v2/_catalog
should output in something similar to this
{"repositories":[]}

Next configure your docker client to use this mirror. See this previous post on how to do that.

Once client side is configured, you can pull a image from a remote dockerhub via your local mirror. For example run
docker pull ubuntu:17.10
like you normally would. then run
curl http://10.0.0.7:5000/v2/_catalog"
again to see the following
{"repositories":["library/ubuntu"]}

This should significantly improve the speed of any subsequent pull from the local clients. Hope you someone finds this useful.

Create this folder if this /etc/systemd/system/docker.service.d/ if it does not exist yet.

For adding local registry mirror add override.conf and file to the folder and the following config


sudo mkdir /etc/systemd/system/docker.service.d/
sudo vi /etc/systemd/system/docker.service.d/override.conf


[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --registry-mirror=http://10.0.0.7:5000

For adding proxy to docker add proxy.conf with the following config


sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf


[Service]
Environment="HTTP_PROXY=http://proxy.example.com:80/" "HTTPS_PROXY=http://proxy.example.com:80/" "NO_PROXY=localhost,127.0.0.0,10.0.0.7"

excluding 10.0.0.7 because, that my local registry mirror ūüėČ in both cases you need to reload the systemd and docker daemon to take effect.


sudo systemctl daemon-reload
sudo service docker restart

After reload run
sudo docker system info
and see output to confirm the changes have taken effect


..
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://proxy.example.com:80/
Https Proxy: http://proxy.example.com:80/
No Proxy: localhost,127.0.0.0,10.0.0.7
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
http://10.0.0.7:5000/
Live Restore Enabled: false

Since its been a while I decided to upgrade my ml box to cuda 9.0, man that was fun, lots of googling with multiple visits to ubuntu and nvidia forums and reading up on several blog posts and stackoverflow articles and almost at the end of the long day am running cuda 9.0, Cudnn 7 and tensorflow 1.5 GPU enabled with models with Keras 2.1.x.

the short version is almost 80% of problems were from lingering packages and changes made to the machine during the last install . So the key is to make sure you roll back and remove the packages cleanly before proceeding. the final step is actually very simple, good job nvidia!.

first we need to remove all the old packages installed


sudo apt-get purge nvidia-* -y 
sudo apt-get purge cuda-* -y
sudo apt-get purge libcuda* -y
sudo apt-get purge libcudnn* -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y
sudo apt-get update

Then remove any repo’s that you have added


sudo rm /etc/apt/sources.list.d/nvidia-diag-driver-local-384.66.list
sudo rm /etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-xenial.list

Then make sure there is nothing left over.


sudo dpkg --list | grep nvidia
sudo dpkg --list | grep cuda
sudo dpkg --list | grep libcudnn

If you find any packages use dpkg to remove them, ex:


sudo dpkg --purge libcudnn5
sudo dpkg --purge cuda-repo-ubuntu1604
sudo dpkg --purge cuda-cudart-8-0 cuda-cudart-dev-8-0 cuda-cufft-8-0 cuda-curand-8-0 cuda-cusolver-8-0 cuda-cusparse-8-0 cuda-npp-8-0 cuda-nvgraph-8-0 cuda-nvrtc-8-0 cuda-toolkit-9-0

revert gcc and g++ to ver 5 as the latest theano and tf have been updated.


sudo ln -s /usr/bin/gcc-5 /usr/bin/gcc -f
sudo ln -s /usr/bin/g++-5 /usr/bin/g++ -f

now reboot the machine and then once it loads make sure there is not old packages and nvidia kernel module is not loaded


lsmod | grep nvidia

now install the cuda repo package and add the cuda gpk keys before installing the cuda meta package.


sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda-9-0 -y

This option seems to have been significantly improved, it automatically installed the correct nvidia drivers (390.30) via the cuda-drivers package and the blas package (cuda-cublas-9-0) without any mucking around from the user, it does take a while though.

Once its complete, go ahead and reboot the machine and once its back up you should have the nvidia module loaded


lsmod | grep nvidia
nvidia_uvm            761856  4
nvidia_drm             40960  0
nvidia_modeset       1093632  1 nvidia_drm
drm_kms_helper        155648  1 nvidia_drm
drm                   364544  3 drm_kms_helper,nvidia_drm
nvidia              14327808  494 nvidia_modeset,nvidia_uvm
ipmi_msghandler        49152  3 ipmi_ssif,nvidia,ipmi_si

also run nvidia-smi


 nvidia-smi
Fri Mar  2 00:20:04 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   37C    P0    33W / 166W |      0MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   39C    P5    15W / 166W |      0MiB /  8119MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

if any of the above does not work, remember to update the .bashrc PATH variables to the cuda 9.0 folder


export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/include${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME="/usr/local/cuda"
export MKL_THREADING_LAYER=GNU

If you have come this far, installing theano or tensorflow is pretty trivial these days thanks to anaconda python distribution, in my case i use the miniconda installer and then install the required packages and dependencies.


wget --quiet http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
/bin/bash ~/miniconda.sh -b -p /opt/conda
export PATH="/opt/conda/bin:$PATH"
conda install --quiet --yes keras tensorflow theano ipython pandas scipy scikit-learn mkl-service
export MKL_THREADING_LAYER=GNU

MKL_THREADING_LAYER is only need for theano.

I finally upgraded from my previous GTX 980 Ti to GTX 1070 last week, unfortunately that meant revisiting some of my previous issues with ubuntu and various incompatibilities among the graphics drivers and cuda components.  In any case I decided this time I will document some of this stuff more cleanly so I can refer to it later.

My Setup:

My previous primary desktop currently re-purposed for machine learning and docker experiments etc..

Hardware:

  • Intel(R) Core(TM) i7-3770K CPU
  • ASUS MAXIMUS IV GENE-Z Motherboard
  • Nvidia GTX 1070

Software:

  • Ubuntu 16.04
  • Nvidia driver 367.35
  • CUDA 8.0 RC
  • Anaconda/Theano/Keras native¬†and as¬†docker containers

Steps:

After replacing the 980 ti card with 1070 I reloaded the machine and it just went into a crash and backtrace loop as the previous nvidia driver nvidia-352 did not support the GTX 1070.

Step one was recovering the system, load into recovery mode, load networking (optional), drop into root shell.

sudo apt-get purge nvidia-*
sudo apt-get autoremove
sudo reboot

This will remove the previous nvidia drivers and dependencies and allow you to do a fresh install of the drivers.

if you haven’t done already make sure you are running ¬†gcc 4.9 to avoid compile errors with Theano

sudo apt-get install gcc-4.9 g++-4.9
sudo ln -s  /usr/bin/gcc-4.9 /usr/bin/gcc -f
sudo ln -s  /usr/bin/g++-4.9 /usr/bin/g++ -f

Download CUDA 8.0 RC, download the runfile(local), when installing cuda 8.0 decline on installing NVIDIA drivers. then reboot.

Install NVIDIA 365.35 drivers

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-367
sudo reboot

At this point you should be able to run nvidia-smi and get some results like this

Thu Aug 4 01:19:40 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:01:00.0 Off | N/A |
| 0% 40C P8 11W / 166W | 103MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4921 C ...y/anaconda3/envs/keras104_py27/bin/python 101MiB |
+-----------------------------------------------------------------------------+

At this point make sure you have the binary and library path’s setup correctly and that nvcc is working fine, adding the following to your .bashrc should do the trick.

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

nvcc -V should give you

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26

and you should be able to run the example here¬†and get “Used the gpu” as output.

My several hours of research in a 10min post ūüôā