How to set up your own in-house HPC Cluster (and the many ways to use it)

August 4, 2016 Vikas Kumar

High Performance Computing (HPC) and Systems Engineering are part of our company DNA (our 5 core areas of expertise, to be exact), so needless to say, we really enjoy setting up our own hardware clusters. Recently we came across an opportunity to purchase some really great systems from an IT liquidator company. We couldn't pass it up... We ended up buying two NVIDIA Tesla S1070 1U servers, each with 4 NVIDIA Tesla C1060 GPUs, and 8 Dell 1950 III 1U servers, each with 2 quad-core Xeon 5400 CPUs and 16 GB RAM. The specification manuals of the NVIDIA system are available here and for the Dell system, here.

Despite cloud computing getting more popular, it is definitely useful to have your own in-house development HPC cluster for conducting work. Here's how we went about setting one up for ourselves.

We set up an effective cluster with 64 processing cores and 8 powerful GPUs. We plan to use this for prototyping our HPC products, password cracking using Hashcat, playing with Google's Tensorflow software for deep learning or even Litecoin mining. But building your own cluster can be useful for any purpose (need some extra heating for your office in the winter?).

The Server Rack

We need the following parts in addition to the servers to setup the cluster:

1 42U Dell server rack ($250 off of Craigslist)
2 1U server rails ($20 each on eBay) for the NVIDIA Tesla S1070
8 1U Dell Rails for the Dell 1950 III servers (came with the servers)
1 rackmount Power Distribution Unit (PDU). We used CyberPower CPS-1220RMS which has 12 outlets. ($67 on Amazon)
1 or 2 2200VA rack mounted Uninterruptible Power Supply (UPS). We used one CyberPower OR2200PFCRT2U ($420 on Newegg).
4 NVIDIA PCI-e x16 host cards ($25 on eBay)
4 Dell-NVIDIA H6GPT Molex PCI-e extension cables for the host PCI-e x16 host cards ($35 each on eBay)
4 PCI-e x8 to PCI-e x16 flexible Riser cables for attaching the PCI-e x16 host cards to the Dell server's PCI-e x8 slots

Each Dell server cost us $85 and each NVIDIA server cost us $165, including shipping. To connect the NVIDIA servers to the Dell servers, we needed to purchase 4 PCI-e x16 host cards and cables described above. Except for our servers and the server rack, every other item was purchased new.

Each NVIDIA server required 110V 16A input, and the PDU and UPS required 20A electrical sockets. So we ended up adding two extra 20A electrical lines from the mains so that they could take the load of the full rack when all the servers were running at once. This should be enough for our purposes. If we add more such servers in, we will need to add more 20A lines to handle the load during 100% use of all the servers. The NVIDIA servers need to be powered separately from the Dell servers, and that is a good thing since it allows the user to distribute the power usage across various UPS systems.

As far as the UPS goes, we purchased only one since we didn't plan on running anything sensitive on this cluster. Each Dell server requires about 450 W at maximum load, and hence to support half the cluster or 4 Dell servers you would need at least 1800VA. We purchased one 2200VA pure sinewave UPS since that would handle other necessary peripherals like monitor, switches, routers etc. Later we might add one more to handle the rest of the servers. A UPS with a 20A input is better for heavy loads.

In all, the total cost of setup was about $2100 for complete setup including the electrical work, which is much cheaper than a single powerful server you could buy today. But hey, this is a development cluster for ourselves and setting it up was great fun !

Connecting the NVIDIA and Dell Servers

Each NVIDIA server has 4 GPUs connecting to a host machine using 2 PCI-E x16 host cards which need to be installed in the host machine, which in our case is the Dell server. The host PCI-E card is then connected to the NVIDIA server using the Dell-NVIDIA H6GPT Molex connector cable as described in the S1070 documentation.

The Dell servers only have two PCI-E x8 connectors but our host PCI-E cards are x16. Hence, we use an PCI-E x8-x16 flexible Riser cable to connect the host cards to the Dell server. However, this doesn't snugly fit inside the 1U server, and hence we let the PCI-E x16 end of the cable come out of the slot at the rear end of the Dell server and connect the host card to it as shown in the figure below.

Connecting the PCI-e Riser cable to the host cards

Connecting the Host cards to the NVIDIA Server using the PCI-E Dell H6GPT cable

Despite these hanging cards, the server rack doors close without any problems.

Using a PCI-E x8 connector instead of a PCI-E x16 leads to slightly slow data transfer speeds, but that may not be an issue for a development cluster. Speeds are also dependent on the programs being run, where data transfer intensive programs may see a boost in using a direct PCI-E x16 connector while compute intensive programs may not see any difference.

Instead of attaching one host card per Dell server, we attached two host cards per Dell server, making our GPU cluster consist of 4 GPUs per Dell Server. This allows us to run CUDA or OpenCL programs that can take advantage of multiple GPUs on a single machine.

Installing the Required Software

The Dell 1950 III Server is supported out of the box on Ubuntu Server 16.06 (Xenial) 64-bit Linux. So we manually installed Ubuntu Server on each of the eight Dell Servers.

For the two servers that had the NVIDIA host cards connected to them, we installed the NVIDIA drivers and CUDA toolkit using the following procedure.

First we check which version of drivers and CUDA toolkit we need for NVIDIA Tesla S1070. As shown in the figure, we see that the version needed for the drivers is 340.93 and for CUDA toolkit is 6.5.

The NVIDIA Driver selection page for the Tesla S1070 System

On Ubuntu Server 16.06, we have a package nvidia-340-dev that can install the drivers without us having to manually do it. Hence we run the following commands as root:

$ apt-get install nvidia-340-dev libxmu-dev libglu1-mesa-dev \
    g++-4.8 gcc-4.8 libxi-dev freeglut3-dev build-essential \
    cmake gcc g++ wget pkg-config make automake autoconf \
    libtool curl

NOTE: We had to install g++-4.8 because the CUDA toolkit 6.5 requires it.

We then download the CUDA toolkit 6.5 from NVIDIA directly which has an MD5 sum of 90b1b8f77313600cc294d9271741f4da as given on the NVIDIA website. (Yes, we know MD5 is not secure, but NVIDIA still uses it.)

# download the driver binary blob
$ wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

# run the MD5 sum program to verify the checksum
$ md5sum cuda_6.5.14_linux_64.run
90b1b8f77313600cc294d9271741f4da  cuda_6.5.14_linux_64.run

We now extract the file so that we can selectively install only the items we want such as the CUDA SDK and the samples. We do not want to install the NVIDIA drivers that come with this binary blob. We do want to build all the samples so that we can test the GPUs.

## extract the cuda installer into the cuda_installer directory
$ sh ./cuda_6.5.14_linux_64.run -extract=cuda_installer

## change directory
$ cd cuda_installler

## run the SDK install binary
$ ./cuda-linux64-rel-6.5.14-18749181.run -noprompt

## setup the linker for use by everyone
$ echo /usr/local/cuda-6.5/lib64 > /etc/ld.so.conf.d/cuda_ld.conf
$ ldconfig
$ ln -s /usr/local/cuda-6.5 /usr/local/cuda

## run the sample install binary
$ ./cuda-samples-linux-6.5.14-18745345.run -noprompt -prefix=/usr/local/cuda-6.5/samples -cudaprefix=/usr/local/cuda-6.5/

## build the samples
$ cd /usr/local/cuda-6.5/samples
$ GCC=g++-4.8 EXTRA_NVCCFLAGS="-D_FORCE_INLINES" make -j4

At this stage we have now successfully installed and built the CUDA software and the NVIDIA drivers. So let's reboot the system.
Once the system has booted up, let's login again either as root or as a regular user and run lspci to make sure that the NVIDIA host cards and the GPUs are detected.

See this content in the original post

Let's run lsmod to check that the nvidia driver has been loaded. You will see that there is the radeon driver also loaded but that's because the Dell server has an on-board AMD/ATI VGA unit.

$ lsmod | grep nvidia
nvidia_uvm             36864  0
nvidia              10567680  1 nvidia_uvm
drm                   360448  5 ttm,drm_kms_helper,nvidia,radeon

Let's setup some environment variables needed to run the CUDA samples.

## setup your .bashrc with the following
export CUDA_PATH=/usr/local/cuda
export CUDA_SAMPLES_PATH=${CUDA_PATH}/samples/bin/x86_64/linux/release
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=$CUDA_PATH/bin:$PATH:$CUDA_SAMPLES_PATH

Start a new bash shell and verify that the paths have been set.

## check environment
$ env | grep CUDA
CUDA_PATH=/usr/local/cuda
CUDA_SAMPLES_PATH=/usr/local/cuda/samples/bin/x86_64/linux/release

## check for CUDA SDK binary
$ which nvcc
/usr/local/cuda/bin/nvcc

## check for compiled CUDA sample application
$ which deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery

Now let's run the deviceQuery program that is part of the CUDA SDK and make sure it is detecting all the GPUs.

See this content in the original post

With this we come to the end of the setup of our development HPC cluster. We are now ready for using it.