# Dive Into TensorFlow, Part V: Deep MNIST

This is the fifth article in the series “Dive Into TensorFlow“, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with TensorFlow

Part II: Basic Concepts

Part III: Ubuntu16.04+GTX 1080+CUDA8.0+cuDNN5.0+TensorFlow

Part IV: Hello MNIST

Part V: Deep MNIST (this article)

**Convolutional Neural Network**

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs, which also called CNNs or ConvNets and made up of neurons that have learnable weights and biases. According wikipedia, Convolutional Neural Network is described like the following:

In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Individual neurons of the animal cortex are arranged in such a way that they respond to overlapping regions tiling the visual field, which can mathematically be described by a convolution operation. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. They have wide applications in image and video recognition, recommender systems and natural language processing.

This chapter will use TensorFlow to build a multilayer convolutional network for MNIST database task. First load the MINST data from the exist dir:

ipython Python 2.7.12 (default, Jul 1 2016, 15:12:24) Type "copyright", "credits" or "license" for more information. IPython 2.4.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from tensorflow.examples.tutorials.mnist import input_data In [2]: mnist = input_data.read_data_sets('MNIST_data', one_hot=True) Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz |

We will follow the official guide to explain how to build a multilayer convolutional network for this mnist task.

**Convolution**

The purpose of Convolution is to extract features from the input data. Here consider a simple example with a 5 x 5 image where the pixel values are only 0 and 1:

And, consider another 3 x 3 matrix as shown below, the 3×3 matrix in CNN is also called a ‘filter‘ or ‘kernel’ or ‘feature detector’:

The Convolution operation of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in below figure, and the output matrix is called Convolved Feature or Feature Map:

Before diving into tensorflow, you should know another two concepts:

**Stride**: Stride is the number of pixels by which we slide the filter matrix over the input matrix. Here the ‘stride’ is 1 which means we slide the filter 1 pixel every time.

**Zero-padding**: Pad the input matrix with zeros around the border, like following example:

From the tensorflow tutorial, the official tensorflow deep mnist guide use the tf.nn.conv2d for the convolution operation:

tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None)Computes a 2-D convolution given 4-D input and filter tensors.

Given an input tensor of shape [batch, in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels], this op performs the following:

Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].

Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].

For each patch, right-multiplies the filter matrix and the image patch vector.

In detail, with the default NHWC format,output[b, i, j, k] =

sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] *

filter[di, dj, q, k]

Must have strides[0] = strides[3] = 1. For the most common case of the same horizontal and vertices strides, strides = [1, stride, stride, 1].Args:

input: A Tensor. Must be one of the following types: half, float32, float64.

filter: A Tensor. Must have the same type as input.

strides: A list of ints. 1-D of length 4. The stride of the sliding window for each dimension of input. Must be in the same order as the dimension specified with format.

padding: A string from: “SAME”, “VALID”. The type of padding algorithm to use.

use_cudnn_on_gpu: An optional bool. Defaults to True.

data_format: An optional string from: “NHWC”, “NCHW”. Defaults to “NHWC”. Specify the data format of the input and output data. With the default format “NHWC”, the data is stored in the order of: [batch, in_height, in_width, in_channels]. Alternatively, the format could be “NCHW”, the data storage order of: [batch, in_channels, in_height, in_width].

name: A name for the operation (optional).

Returns:A Tensor. Has the same type as input.

For the example showed before, we can get the convolution result by the tensorflow conv2d like this:

In [3]: import tensorflow as tf In [4]: input = tf.constant([[1., 1., 1., 0., 0.], [0., 1., 1., 1., 0.], [0., 0., 1., 1., 1.], [0., 0., 1., 1., 0.], [0., 1., 1., 0., 0.]]) In [5]: input = tf.reshape(input, [1, 5, 5, 1]) In [6]: filter = tf.constant([[1., 0., 1.], [0., 1., 0.], [1., 0., 1.]]) In [7]: filter = tf.reshape(filter, [3, 3, 1, 1]) In [8]: op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID') In [9]: with tf.Session() as sess: ...: print("input:") ...: print(input.eval()) ...: print("filter:") ...: print(filter.eval()) ...: result = sess.run(op) ...: print("result:") ...: print(result) ...: input: [[[[ 1.] [ 1.] [ 1.] [ 0.] [ 0.]] [[ 0.] [ 1.] [ 1.] [ 1.] [ 0.]] [[ 0.] [ 0.] [ 1.] [ 1.] [ 1.]] [[ 0.] [ 0.] [ 1.] [ 1.] [ 0.]] [[ 0.] [ 1.] [ 1.] [ 0.] [ 0.]]]] filter: [[[[ 1.]] [[ 0.]] [[ 1.]]] [[[ 0.]] [[ 1.]] [[ 0.]]] [[[ 1.]] [[ 0.]] [[ 1.]]]] result: [[[[ 4.] [ 3.] [ 4.]] [[ 2.] [ 4.] [ 3.]] [[ 2.] [ 3.] [ 4.]]]] |

Note that you can use the ‘SAME’ padding argument which add zero-padding for the input matrix, and cause the stride is 1, so that we can get the same size of output matrix as the input:

In [10]: op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='SAME') In [11]: with tf.Session() as sess: print("input:") print(input.eval()) print("filter:") print(filter.eval()) result = sess.run(op) print("result:") print(result) ....: input: [[[[ 1.] [ 1.] [ 1.] [ 0.] [ 0.]] [[ 0.] [ 1.] [ 1.] [ 1.] [ 0.]] [[ 0.] [ 0.] [ 1.] [ 1.] [ 1.]] [[ 0.] [ 0.] [ 1.] [ 1.] [ 0.]] [[ 0.] [ 1.] [ 1.] [ 0.] [ 0.]]]] filter: [[[[ 1.]] [[ 0.]] [[ 1.]]] [[[ 0.]] [[ 1.]] [[ 0.]]] [[[ 1.]] [[ 0.]] [[ 1.]]]] result: [[[[ 2.] [ 2.] [ 3.] [ 1.] [ 1.]] [[ 1.] [ 4.] [ 3.] [ 4.] [ 1.]] [[ 1.] [ 2.] [ 4.] [ 3.] [ 3.]] [[ 1.] [ 2.] [ 3.] [ 4.] [ 1.]] [[ 0.] [ 2.] [ 2.] [ 1.] [ 1.]]]] |

Now we can define a conv2d function in tensorflow to get the convolution which uses a stride of one and are zero padded, so that the output is the same size as the input:

In [15]: def conv2d(x, W): ....: return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') ....: |

**Pooling**

Pooling is also called subsampling or downsampling, it reduce the feature map dimension and keep the important information. There were some different types of polling, like Max, Average, Sum and etc. For example, max pooling take the largest element from the feature map within the window，following shows an example of Max Pooling with a 2×2 window:

The official tensorflow deep mnist guide use the tf.nn.max_pool for the max pooling operation:

tf.nn.max_pool(value, ksize, strides, padding, data_format=’NHWC’, name=None)Performs the max pooling on the input.

Args:

value: A 4-D Tensor with shape [batch, height, width, channels] and type tf.float32.

ksize: A list of ints that has length >= 4. The size of the window for each dimension of the input tensor.

strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor.

padding: A string, either ‘VALID’ or ‘SAME’. The padding algorithm. See the comment here

data_format: A string. ‘NHWC’ and ‘NCHW’ are supported.

name: Optional name for the operation.

Returns:A Tensor with type tf.float32. The max pooled output tensor.

For the example showed before, we can get the max pooling result by the tensorflow max_pool like this:

In [16]: input = tf.constant([[1., 1., 2., 4.], [5., 6., 7., 8.], [3., 2., 1., 0.], [1., 2., 3., 4.]]) In [17]: input = tf.reshape(input, [1, 4, 4, 1]) In [18]: op = tf.nn.max_pool(input, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID') In [19]: with tf.Session() as sess: print("input:") print(input.eval()) result = sess.run(op) print("result:") print(result) ....: I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) input: [[[[ 1.] [ 1.] [ 2.] [ 4.]] [[ 5.] [ 6.] [ 7.] [ 8.]] [[ 3.] [ 2.] [ 1.] [ 0.]] [[ 1.] [ 2.] [ 3.] [ 4.]]]] result: [[[[ 6.] [ 8.]] [[ 3.] [ 4.]]]] |

Now we can simplify a max pooling function based on the tensorflow.nn.max_pool() function, where the pooling is plain old max pooling over 2×2 blocks:

In [23]: def max_pool_2x2(x): ....: return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') ....: |

**ConvNet Architecture**

Now we can describe the architecture of the Convolutional Neural Network used in this chapter:

`INPUT -> [[CONV -> RELU] -> POOL] -> [[CONV -> RELU] -> POOL] -> FC`

Before diving into the ConvNet, we should first define two initial functions for the weights and bias:

In [26]: def weight_variable(shape): ....: initial = tf.truncated_normal(shape, stddev=0.1) ....: return tf.Variable(initial) ....: In [27]: def bias_variable(shape): ....: initial = tf.constant(0.1, shape=shape) ....: return tf.Variable(initial) ....: |

where the tf.truncated_normal define like this:

tf.truncated_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None, name=None)Outputs random values from a truncated normal distribution.

The generated values follow a normal distribution with specified mean and standard deviation, except that values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked.

Args:

shape: A 1-D integer Tensor or Python array. The shape of the output tensor.

mean: A 0-D Tensor or Python value of type dtype. The mean of the truncated normal distribution.

stddev: A 0-D Tensor or Python value of type dtype. The standard deviation of the truncated normal distribution.

dtype: The type of the output.

seed: A Python integer. Used to create a random seed for the distribution. See set_random_seed for behavior.

name: A name for the operation (optional).

Returns:A tensor of the specified shape filled with random truncated normal values.

Based on the TensorFlow operation rules, we can test the two initial functions as below:

In [29]: weight_one = weight_variable([1]) In [30]: init = tf.initialize_all_variables() In [31]: with tf.Session() as sess: ....: sess.run(init) ....: sess.run(weight_one) ....: print(weight_one.eval()) ....: [ 0.08119224] In [32]: with tf.Session() as sess: sess.run(init) sess.run(weight_one) print(weight_one.eval()) ....: [ 0.07671448] In [33]: weight_two = weight_variable([2, 3]) In [34]: init = tf.initialize_all_variables() In [35]: with tf.Session() as sess: sess.run(init) sess.run(weight_two) print(weight_two.eval()) ....: [[ 0.07384183 -0.10221136 -0.06982277] [ 0.02848824 -0.04227843 -0.02429868]] In [36]: weight_three = weight_variable([2, 3, 2]) In [37]: init = tf.initialize_all_variables() In [38]: with tf.Session() as sess: sess.run(init) sess.run(weight_three) print(weight_three.eval()) ....: [[[-0.11059111 0.11133649] [-0.13025205 0.04462459] [ 0.06428232 -0.10575314]] [[ 0.15977345 0.07958484] [-0.1151159 -0.09550249] [ 0.17477499 -0.01628241]]] In [39]: bias = bias_variable([3]) In [40]: init = tf.initialize_all_variables() In [41]: with tf.Session() as sess: sess.run(init) sess.run(bias) print(bias.eval()) ....: [ 0.1 0.1 0.1] |

**First Convolutional Layer**

Now we can define the convolutional filter and the bias variable. The convolutional will compute 32 features for each 5×5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

In [51]: W_conv1 = weight_variable([5, 5, 1, 32]) In [52]: b_conv1 = bias_variable([32]) |

Recall the mnist input we have defined before:

In [53]: x = tf.placeholder(tf.float32, shape=[None, 784]) |

To apply it as the input of the first convolutional layer, we reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels:

In [54]: x_image = tf.reshape(x, [-1, 28, 28, 1]) |

Then convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool:

In [55]: h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) In [56]: h_pool1 = max_pool_2x2(h_conv1) |

Which correspond to: `INPUT -> [[CONV -> RELU] -> POOL]`

**Second Convolutional Layer**

`-> [[CONV -> RELU] -> POOL]`

The second layer is designed with 64 features for each 5×5 patch:

In [58]: W_conv2 = weight_variable([5, 5, 32, 64]) In [59]: b_conv2 = bias_variable([64]) In [60]: h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) In [61]: h_pool2 = max_pool_2x2(h_conv2) |

**Densely Connected Layer**

Now that the image size has been reduced to 7×7, we will add a fully-connected layer with 1024 neurons to allow processing on the entire image. So reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU:

In [62]: W_fc1 = weight_variable([7 * 7 * 64, 1024]) In [63]: b_fc1 = bias_variable([1024]) In [64]: h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) In [65]: h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) |

**Dropout**

To reduce overfitting, apply a dropout based TensorFlow’s tf.nn.dropout op before the readout layer:

In [66]: keep_prob = tf.placeholder(tf.float32) In [67]: h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob) |

**Readout Layer or softmax layer**

Finally, we add a softmax layer:

In [68]: W_fc2 = weight_variable([1024, 10]) In [69]: b_fc2 = bias_variable([10]) In [70]: y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) |

**Train and Evaluate the Model**

Now it’s time to train and evaluate the CNN Model:

In [83]: y_ = tf.placeholder(tf.float32, shape=[None, 10]) In [84]: cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1])) In [85]: train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) In [86]: correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1)) In [87]: accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) In [89]: sess = tf.InteractiveSession() I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) In [90]: sess.run(tf.initialize_all_variables()) In [92]: for i in range(20000): batch = mnist.train.next_batch(50) if i % 100 == 0: train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0}) print("step %d, training accuracy %g" % (i, train_accuracy)) train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}) ....: step 0, training accuracy 0.16 step 100, training accuracy 0.84 step 200, training accuracy 0.94 step 300, training accuracy 0.92 step 400, training accuracy 0.94 step 500, training accuracy 0.96 step 600, training accuracy 0.98 step 700, training accuracy 0.96 step 800, training accuracy 0.96 step 900, training accuracy 0.92 step 1000, training accuracy 1 ... ... ... step 19000, training accuracy 1 step 19100, training accuracy 1 step 19200, training accuracy 1 step 19300, training accuracy 1 step 19400, training accuracy 1 step 19500, training accuracy 1 step 19600, training accuracy 1 step 19700, training accuracy 1 step 19800, training accuracy 1 step 19900, training accuracy 1 In [93]: print("test accuracy %g" % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})) test accuracy 0.9925 |

Based on the TensorFlow and CNN model, we get 99.25% accuracy of the test set, it’s great.

Reference:

Deep MNIST for Experts

An Intuitive Explanation of Convolutional Neural Networks

UNDERSTANDING CONVOLUTIONAL NEURAL NETWORKS FOR NLP

CS231n Convolutional Neural Networks for Visual Recognition

TensorFlow Tutorial Resources

What does tf.nn.conv2d do in tensorflow?

Posted by TextMiner

very useful! i enjoyed it a lot.