A simple guide from linear regression to convolutional neural networks in Tensorflow

If you've been following the machine learning community, in particular that of deep learning, over the last year, you've probably heard of Tensorflow. Tensorflow is a library to structure and run numerical computations developed in-house by Google Brain (the people who developed Alpha-GO). One can imagine this library as an extension of NumPY to work on more scalable architectures, as well as with more detailed algorithms and methods that pertain specifically to machine learning. Tensorflow joins Theano and cuDNN as architectures for building and designing neural networks.

This article hopes to delve into Tensorflow through case studies of implementations of Neural Networks. As such, it requires advance knowledge of neural networks (the subject is too expansive to cover in a single article). For those new (and for those who need a refresher), here are some good reference materials

- http://neuralnetworksanddeeplearning.com/ (Basic)
- http://www.deeplearningbook.org/ (More Advanced)

Tensorflow is available on PyPI, so we can simply pip install

` pip install tensorflow`

Or if you have a GPU

` pip install tensorflow-gpu`

More extensive installation details can be found on the Tensorflow Installation Website

We follow the Theano Tutorials which build up from basic addition/multiplication all the way to convolutional neural networks

- Multiplication
- Linear Regression
- Logistic Regression

- Fully-connected Feed-Forward Neural Network (FC NN)
- "Deep" Fully Connected Neural Network
- Convolutional Neural Network

Given two floats $x$ and $y$, find $xy$

`x`

and `y`

(initializing them as floats). Placeholders can be thought of as *inputs*; when doing computations, we'll plug in values for x and y. We symbolize the result that we are looking for as `xy`

.

Given $\{(x_1,y_1) \dots (x_n,y_n)\}$, find $w$ and $b$ such that it minimizes $$\sum (wx_i + b - y_i)^2$$

First, let's create some sample data to work with:

We model $y = 2x + \mathcal{N}(0,1)$ (there's some random noise)

`x`

and `y`

again. We define a *variable* `w`

which stores the weight; variables are objects in Tensorflow which we use to represent internal states and are updatable. Again `y_hat`

is simply our prediction

Let's now define our cost model and the underlying optimizer. Here, we opt for the squared loss objective (there are many others similar).

In order to optimize the function over $w$ and $b$, we create a GD optimizer, and minimize over the given cost function. Here we set $\alpha = .01$ (the learning rate)

`train_operation`

, passing in our input data (this is Gradient Descent , not SGD). Since we created variables (`w`

and `b`

), we need to initialize them in the session with `tf.initialize_all_variables().run()`

Let's try to expand this to the multivariable case, where $x \in \mathbb{R}^n$,$w \in \mathbb{R}^{n \times m}$, and where $y$ is modelled with gaussian noise as

$$ y = W^Tx + \mathcal{N}(0,I_m)$$Out[77]:

`x`

and `y`

placeholder inputs similarly; however, this time we explicitly add a shape parameter to the data. The first `None`

is the dimension of the batch-size (variable), and the second number our actual dimension.

The rest remains the same

*logit* as a linear transformation of $x$, and perform MLE over $W$. Notice the MLE likelihood function is simply just the cross entropy on the logit of the linear model. Using this philosophy, we express our logistic model

Let's train! We'll do batch gradient descent here to speed up training times

Out[22]:

*"classic"* starting neural network; which consists of the input layer, a hidden layer coupled with the *sigmoid* activation function, and finally an output layer, upon which we shall run softmax (paired with the cross-entropy loss). As in the previous example on logistic regression, the softmax won't be directly computed, and instead implicitly factored in through the cost function. We again shall train on the **MNIST** dataset

Out[24]:

Here we implement the following changes on the previous neural network to increase accuracy on MNIST

- RELU
- Dropout
- RMSProp Optimization
- Another hidden layer cause
*why not?*

We also shift the organization of the code,by abstracting out the model, so it is easier to parse when reading. As our models and networks get more complicated, this becomes a good idea to facilitate debugging

*very* well on MNIST and other image databases, making it the defacto algorithm for image based ML in the industr

Out[71]: