Micah P. Dombrowski / Mar 01 2020

Tensorflow 2.0.1

Default environment for Tensorflow w/ Keras

This notebook builds a reusable environment for Tensorflow, based on the Python 3 environment. Tensorflow is compiled here, to make use of SIMD instruction sets and the cuDNN, NCCL, and TensorRT CUDA libraries.

If the end state of the runtime in which Tensorflow was compiled is needed, the Build Py3 TF environment is also exported. In addition, the wheel installation file of this compiled Tensorflow is available for download here:

Showcase

Plain Tensorflow

We'll follow the deep convolutional generative adversarial networks (DCGAN) example by Aymeric Damien, from the Tensorflow Examples project, to generate digit images from a noise distribution.

Reference paper: Unsupervised representation learning with deep convolutional generative adversarial networks. A Radford, L Metz, S Chintala. arXiv:1511.06434.

First, parameters.

# MNIST Dataset parameters.num_features = 784 # data features (img shape: 28*28).# Training parameters.lr_generator = 0.0002lr_discriminator = 0.0002training_steps = 20000batch_size = 128report_step = 1000display_step = 2500# Network parameters.noise_dim = 100 # Noise data points.

0.1s

Tensorflow Test (Python)

Setup Data

from __future__ import absolute_import, division, print_functionimport tensorflow as tffrom tensorflow.keras import Model, layersimport numpy as np# Prepare MNIST data.from tensorflow.keras.datasets import mnist(x_train, y_train), (x_test, y_test) = mnist.load_data()# Convert to float32.x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)# Normalize images value from [0, 255] to [0, 1].x_train, x_test = x_train / 255., x_test / 255.

3.4s

Tensorflow Test (Python)

# Use tf.data API to shuffle and batch data.train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))train_data = train_data.repeat().shuffle(10000).batch(batch_size).prefetch(1)

1.3s

Tensorflow Test (Python)

Define networks.

# Create TF Model.class Generator(Model):    # Set layers.    def __init__(self):        super(Generator, self).__init__()        self.fc1 = layers.Dense(7 * 7 * 128)        self.bn1 = layers.BatchNormalization()        self.conv2tr1 = layers.Conv2DTranspose(64, 5, strides=2, padding='SAME')        self.bn2 = layers.BatchNormalization()        self.conv2tr2 = layers.Conv2DTranspose(1, 5, strides=2, padding='SAME')    # Set forward pass.    def call(self, x, is_training=False):        x = self.fc1(x)        x = self.bn1(x, training=is_training)        x = tf.nn.leaky_relu(x)        # Reshape to a 4-D array of images: (batch, height, width, channels)        # New shape: (batch, 7, 7, 128)        x = tf.reshape(x, shape=[-1, 7, 7, 128])        # Deconvolution, image shape: (batch, 14, 14, 64)        x = self.conv2tr1(x)        x = self.bn2(x, training=is_training)        x = tf.nn.leaky_relu(x)        # Deconvolution, image shape: (batch, 28, 28, 1)        x = self.conv2tr2(x)        x = tf.nn.tanh(x)        return x# Generator Network# Input: Noise, Output: Image# Note that batch normalization has different behavior at training and inference time,# we then use a placeholder to indicates the layer if we are training or not.class Discriminator(Model):    # Set layers.    def __init__(self):        super(Discriminator, self).__init__()        self.conv1 = layers.Conv2D(64, 5, strides=2, padding='SAME')        self.bn1 = layers.BatchNormalization()        self.conv2 = layers.Conv2D(128, 5, strides=2, padding='SAME')        self.bn2 = layers.BatchNormalization()        self.flatten = layers.Flatten()        self.fc1 = layers.Dense(1024)        self.bn3 = layers.BatchNormalization()        self.fc2 = layers.Dense(2)    # Set forward pass.    def call(self, x, is_training=False):        x = tf.reshape(x, [-1, 28, 28, 1])        x = self.conv1(x)        x = self.bn1(x, training=is_training)        x = tf.nn.leaky_relu(x)        x = self.conv2(x)        x = self.bn2(x, training=is_training)        x = tf.nn.leaky_relu(x)        x = self.flatten(x)        x = self.fc1(x)        x = self.bn3(x, training=is_training)        x = tf.nn.leaky_relu(x)        return self.fc2(x)# Build neural network model.generator = Generator()discriminator = Discriminator()

0.2s

Tensorflow Test (Python)

Network setup.

# Losses.def generator_loss(reconstructed_image):    gen_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(        logits=reconstructed_image, labels=tf.ones([batch_size], dtype=tf.int32)))    return gen_lossdef discriminator_loss(disc_fake, disc_real):    disc_loss_real = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(        logits=disc_real, labels=tf.ones([batch_size], dtype=tf.int32)))    disc_loss_fake = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(        logits=disc_fake, labels=tf.zeros([batch_size], dtype=tf.int32)))    return disc_loss_real + disc_loss_fake# Optimizers.optimizer_gen = tf.optimizers.Adam(learning_rate=lr_generator)#, beta_1=0.5, beta_2=0.999)optimizer_disc = tf.optimizers.Adam(learning_rate=lr_discriminator)#, beta_1=0.5, beta_2=0.999)

0.1s

Tensorflow Test (Python)

# Optimization process. Inputs: real image and noise.def run_optimization(real_images):        # Rescale to [-1, 1], the input range of the discriminator    real_images = real_images * 2. - 1.    # Generate noise.    noise = np.random.normal(-1., 1., size=[batch_size, noise_dim]).astype(np.float32)        with tf.GradientTape() as g:                    fake_images = generator(noise, is_training=True)        disc_fake = discriminator(fake_images, is_training=True)        disc_real = discriminator(real_images, is_training=True)        disc_loss = discriminator_loss(disc_fake, disc_real)                # Training Variables for each optimizer    gradients_disc = g.gradient(disc_loss,  discriminator.trainable_variables)    optimizer_disc.apply_gradients(zip(gradients_disc,  discriminator.trainable_variables))        # Generate noise.    noise = np.random.normal(-1., 1., size=[batch_size, noise_dim]).astype(np.float32)        with tf.GradientTape() as g:                    fake_images = generator(noise, is_training=True)        disc_fake = discriminator(fake_images, is_training=True)        gen_loss = generator_loss(disc_fake)                gradients_gen = g.gradient(gen_loss, generator.trainable_variables)    optimizer_gen.apply_gradients(zip(gradients_gen, generator.trainable_variables))        return gen_loss, disc_loss

0.2s

Tensorflow Test (Python)

Finally, training. Display results every display_step.

import matplotlib.pyplot as pltn_x, n_y = 10, 4canvas = np.empty((28*n_y, 28*n_x))# Run training for the given number of steps.for step, (batch_x, _) in enumerate(train_data.take(training_steps + 1)):  if step == 0:    # Generate noise.    noise = np.random.normal(-1., 1.,                              size=[batch_size, noise_dim]).astype(np.float32)    gen_loss = generator_loss(discriminator(generator(noise)))    disc_loss = discriminator_loss(discriminator(batch_x),                                    discriminator(generator(noise)))    print("initial: gen_loss: %f, disc_loss: %f" % (gen_loss, disc_loss))    continue      # Run the optimization.  gen_loss, disc_loss = run_optimization(batch_x)    # report  if step % report_step == 0:    print("step: %i, gen_loss: %f, disc_loss: %f" %           (step, gen_loss, disc_loss))    # show a sample of generated images  if step % display_step == 0 or step == 1:    for i_y in range(n_y):      # Noise input.      z = np.random.normal(-1., 1., size=[n_x, noise_dim]).astype(np.float32)      # Generate image from noise.      g = generator(z).numpy()      # Rescale to original [0, 1]      g = (g + 1.) / 2      # Reverse colours for better display      g = -1 * (g - 1)             for i_x in range(n_x):        canvas[i_y * 28:(i_y + 1) * 28,                i_x * 28:(i_x + 1) * 28] = g[i_x].reshape([28, 28])    plt.figure(figsize=(n_x, n_y))    plt.imshow(canvas, origin="upper", cmap="gray")    plt.suptitle("Step {}".format(step))    plt.xticks([])    plt.yticks([])    plt.savefig("/results/step-{}.svg".format(step))    plt.close()

3667.8s

Tensorflow Test (Python)

Keras

Adapted from mnist_mlp.py in the Keras examples collection. Can be run on CPU or GPU, just depends what the runtime's Machine Type is set to.

Trains a simple deep NN on the MNIST dataset. Gets to 98.40% test accuracy after 20 epochs(there is *a lot* of margin for parameter tuning). 2 seconds per epoch on a K520 GPU.

Imports and settings.

from __future__ import print_functionimport kerasfrom keras.datasets import mnistfrom keras.models import Sequentialfrom keras.layers import Dense, Dropoutfrom keras.optimizers import RMSpropbatch_size = 128num_classes = 10epochs = 20

0.6s

Tensorflow Test (Python)

Data.

# the data, split between train and test sets(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train = x_train.reshape(60000, 784)x_test = x_test.reshape(10000, 784)x_train = x_train.astype('float32')x_test = x_test.astype('float32')x_train /= 255x_test /= 255print(x_train.shape[0], 'train samples')print(x_test.shape[0], 'test samples')# convert class vectors to binary class matricesy_train = keras.utils.to_categorical(y_train, num_classes)y_test = keras.utils.to_categorical(y_test, num_classes)

0.7s

Tensorflow Test (Python)

Define the model.

model = Sequential()model.add(Dense(512, activation='relu', input_shape=(784,)))model.add(Dropout(0.2))model.add(Dense(512, activation='relu'))model.add(Dropout(0.2))model.add(Dense(num_classes, activation='softmax'))model.summary()model.compile(loss='categorical_crossentropy',              optimizer=RMSprop(),              metrics=['accuracy'])

0.5s

Tensorflow Test (Python)

Training. We can save our result to a file at the end.

history = model.fit(x_train, y_train,                    batch_size=batch_size,                    epochs=epochs,                    verbose=1,                    validation_data=(x_test, y_test))model.save("/results/mnist.kerasave")

54.2s

Tensorflow Test (Python)

mnist.kerasave

In a new runtime, load the test data and saved model with training data.

import kerasfrom keras.datasets import mnistfrom keras.models import load_modelnum_classes = 10(_,_), (x_test, y_test) = mnist.load_data()x_test = x_test.reshape(10000, 784)x_test = x_test.astype('float32')x_test /= 255y_test = keras.utils.to_categorical(y_test, num_classes)model = load_model(mnist.kerasave
)

9.4s

Tensorflow Test Eval (Python)

Evaluate.

score = model.evaluate(x_test, y_test, verbose=0)print('Test loss:', score[0])print('Test accuracy:', score[1])

1.3s

Tensorflow Test Eval (Python)

Setup

Build Tensorflow

Building Tensorflow allows use of SIMD CPU enhancements like AVX. Cuda 9.2 supports up to GCC9. To get the Nvidia CUDA libraries we must set the environment variable NEXTJOURNAL_MOUNT_CUDA in the runtime configuration. Tensorflow can see some speedups if we give it libjemalloc.

apt-get -qq updateapt-get install --no-install-recommends \  xutils-dev zlib1g-dev libjemalloc-devupdate-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 25update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 25echo "/usr/local/cuda/extras/CUPTI/lib64" > /etc/ld.so.conf.d/cupti.confldconfig

9.8s

Build Py3 TF (Bash)

Install dependencies for the pip package build, listed here.

conda install \  absl-py astor gast google-pasta opt_einsum protobuf termcolor wrapt \  tensorboard tensorflow-estimator keras-applications keras-preprocessing

76.0s

Build Py3 TF (Bash)

Download TensorRT. The link needs to be pulled from the console when downloading off the Nvidia website.

wget -q --show-progress --progress=bar:force \  -O /results/TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz \  'https://developer.download.nvidia.com/compute/machine-learning/tensorrt/secure/7.0/7.0.0.11/tars/TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz?IvoDcdwktouO7vetsr46tPUCrvvjhiGnTyvNwQh2UADKWQIK9J1QgJQ0H7aPUFi_Kp7LDMeEkqE0tW3BU2iiQyJ5h8hm828QCq-EM_L7_pICs44FuJ84sS-PJijFRoZhyXqn6hGM1z0RpO3tTfxyCYtT5SQ21pBM-oJ2iinHfQCw9cDkIt_qR1EoAblQVrJD_mFxEacygyKHfp-sDKCncj7DwWy9UtqNa6cvQqwC57YY_fPdKWC2gVQpu51D'

6.8s

Build Py3 TF (Bash)

TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz

Install TensorRT from tarfile above. Have to fudge the python install because the wheel file is minor-version specific for some reason.

cd /usr/localtar -zxf TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz
ln -sf TensorRT* tensorrtecho '/usr/local/tensorrt/lib' > /etc/ld.so.conf.d/tensorrt.confldconfigcd /usr/local/tensorrtpip install python/tensorrt*cp37*.whl \  uff/uff*.whl graphsurgeon/graphsurgeon*.whl

20.2s

Build Py3 TF (Bash)

The Tensorflow compilation configure script is hardcoded to look for libnccl.so in <nccl_install_dir>/lib, but we have /lib64, so we need to set up some links to redirect it.

mkdir -p /usr/local/nccl_redircd /usr/local/nccl_redirfor i in `ls /usr/local/cuda`; do ln -s /usr/local/cuda/$i ./; doneln -s lib64 lib

0.1s

Build Py3 TF (Bash)

Install Bazel. Tensorflow 2.0.1 works with Bazel 0.26.1.

export BAZEL_VERSION=0.26.1export BAZEL_FILE=bazel-${BAZEL_VERSION}-installer-linux-x86_64.shwget --progress=dot:giga \  https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/$BAZEL_FILEchmod +x $BAZEL_FILE./$BAZEL_FILE

3.4s

Build Py3 TF (Bash)

Clone the source and checkout the release.

git clone https://github.com/tensorflow/tensorflowcd tensorflowgit checkout v2.0.1

55.8s

Build Py3 TF (Bash)

This configure script uses environment variables to do a non-interactive config. The march flag set through CC_OPT_FLAGS is of particular interest for CPU-only computation, as it controls which SIMD instruction sets Tensorflow will use, which can have large performance impacts. Some important flag values:

nehalem: Core-i family (circa 2008) supports MMX, SSE1-4.2, and POPCNT, equivalent to the corei7 march flag pre-GCC5.
sandybridge: Adds AVX (large potential speedups), AES and PCLMUL, and is oldest family that the Google Cloud runs (2011). Requires GCC5+.
skylake: Adds a wide variety of SIMD instructions, including AVX2, and is currently the newest family the Google Cloud has. Requires GCC6+.

Also of interest for CPU computation is TF_NEED_MKL. Enabling this compiles Tensorflow to use the Intel Math Kernel Library, which is highly optimized for any CPU the Google Cloud will provide. In Tensorflow the MKL and CUDA are mutually exclusive—MKL is reserved for CPU-optimized builds.

cd /tensorflowexport TF_ROOT="/opt/tensorflow"export PYTHON_BIN_PATH="/opt/conda/bin/python"export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"export PYTHONPATH=${TF_ROOT}/libexport PYTHON_ARG=${TF_ROOT}/libexport TF_NEED_GCP=1   # Google Cloudexport TF_NEED_HDFS=1  # Hadoop Filesystem accessexport TF_NEED_S3=1    # Amazon S3export TF_NEED_AWS=0   # Amazon AWSexport TF_NEED_IGNITE=1export TF_NEED_KAFKA=1 # Apache KAFKAexport TF_NEED_JEMALLOC=1 # Alternative mallocexport TF_NEED_GDR=0   # GPU Direct RDMAexport TF_NEED_VERBS=0 # VERBS RDMAexport TF_NEED_CUDA=1export CUDA_TOOLKIT_PATH=/usr/local/cudaexport TF_CUDA_VERSION="$($CUDA_TOOLKIT_PATH/bin/nvcc --version | sed -n 's/^.*release \(.*\),.*/\1/p')"export TF_CUDA_COMPUTE_CAPABILITIES=7.0,6.1,6.0,3.7 # V100, P100, P4, K80export CUDNN_INSTALL_PATH=/usr/local/cudaexport TF_CUDNN_VERSION="$(sed -n 's/^#define CUDNN_MAJOR\s*\(.*\).*/\1/p' $CUDNN_INSTALL_PATH/include/cudnn.h)"export TF_NEED_TENSORRT=1  # Nvidia TensorRTexport TENSORRT_INSTALL_PATH=/usr/local/tensorrtexport NCCL_INSTALL_PATH=/usr/local/nccl_redir # Nvidia NCCLexport TF_NCCL_VERSION="$(sed -n 's/^#define NCCL_MAJOR\s*\(.*\).*/\1/p' $NCCL_INSTALL_PATH/include/nccl.h)"export TF_CUDA_CLANG=0    # Use clang compiler instead of nvccexport TF_NEED_OPENCL=0export TF_NEED_OPENCL_SYCL=0export TF_NEED_ROCM=0export TF_ENABLE_XLA=0    # Accelerated Linear Algebra JIT compilerexport TF_NEED_MKL=0       # Intel Math Kernel Libraryexport TF_DOWNLOAD_MKL=0export TF_NEED_MPI=0       # Message Passing Interfaceexport TF_SET_ANDROID_WORKSPACE=0export GCC_HOST_COMPILER_PATH=$(which gcc)export CC_OPT_FLAGS="-march=sandybridge"./configure

8.5s

Build Py3 TF (Bash)

Finally, the build—this takes about 11 hours.

export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/nvidia/lib64"export CUDNN_INCLUDE_DIR="/usr/local/cuda/include"export CUDNN_LIBRARY="/usr/local/cuda/lib64/libcudnn.so"export TMP="/tmp"cd /tensorflowbazel build --experimental_ui_limit_console_output=0 \  --config=v2 --config=avx_linux \  --config=opt --config=cuda --verbose_failures --jobs="auto" \  --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" \  --action_env="CUDNN_INCLUDE_DIR=${CUDNN_INCLUDE_DIR}" \  --action_env="CUDNN_LIBRARY=${CUDNN_LIBRARY}" \  //tensorflow/tools/pip_package:build_pip_package 

37299.9s

Build Py3 TF (Bash)

We'll export this environment just in case anyone wants to play with the compiled result, but the important part here is the creation of a .whl wheel file which can be installed via pip.

cd /tensorflowbazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkgcp /tmp/tensorflow_pkg/tensorflow*.whl /results/

72.8s

Build Py3 TF (Bash)

tensorflow-2.0.1-cp37-cp37m-linux_x86_64.whl

Install Tensorflow and Frontends to Environment

Finally, we'll install the package we created in a clean environment, plus the TFLearn and standalone Keras frontends.

conda install \  absl-py astor google-pasta opt_einsum protobuf termcolor wrapt \  mock pbr h5py grpcio markdown werkzeug cython jemalloc \  pyyaml graphviz pydot # for use with Kerasconda clean -qtipyecho "/usr/local/cuda/extras/CUPTI/lib64" > /etc/ld.so.conf.d/cupti.conf

51.3s