Introduction to pyTorch
Deep-Learning has gone from breakthrough but mysterious field to a well known and widely applied technology. In recent years (or months) several frameworks based mainly on Python were created to simplify Deep-Learning and to make it available to the general public of software engineer. In this battle field to be the future framework of reference, some stand out such a Theano, Keras and especially Google’s TensorFlow and Facebook’s pyTorch. This article is the first of a series of tutorial on pyTorch that will start with the basic gradient descend algorithm to very advanced concept and complex models. The goal of this article is to give you a general but useful view of the gradient descent algorithm used in all the Deep-Learning frameworks.
Do not miss the two next articles :
- Introduction to pyTorch #2 : The linear regression;
- Introduction to pyTorch #3 : Image classification with CNN;
The gradient descent algorithm
The purpose of the gradient descent algorithm is to find the minimum of a function. As you will see, all the training algorithms in machine learning consist in finding the minimum of a function which represents the difference between what we have (the output of a mathematical model) and what we want (the target output to be learned). We can start with a simple function, with two variables, and try to find the minimum with an iterative algorithm. The main idea behind GDA is to evaluate the gradient of a function at a certain point, and to move in the direction of the gradient, just like a stone would roll along a slope. More complex algorithm are more efficience but there are all based on this principle.
We will then start with a simple quadratic equation with two variables. We want to find the global minimum of this equation, global meaning here the smallest overall value of a function over its entire range. Of course this is not always possible as some functions are non-convex. In mathematics, a function is called convex if the line segment between any two points on the graph of the function lies above or on the graph. Non-convex functions have local minima, also called a relative minimum, which is a minimum within some neighbourhood that need not be a global minimum. You can see below an example of a convex and non-convex function
We want here to find the minimum of the following function.
To find a minimum, the gradient algorithm works as follow,
- We take an initial point ;
- We compute the gradient at the current position : ;
- We compute the next point : , where is the learning rate;
- Repeat 2 and 3 for a specified number of iteration or until a stop condition fullfilled;
We start by importing some packages for argument parsing, plots, math and obviously numpy.
import argparse import matplotlib.pyplot as plt import math import numpy as np from mpl_toolkits.mplot3d import axes3d, Axes3D from matplotlib import cm
We implement the function we want to minimise.
# Function def func(xv, yv): """ Function :param xv: X :param yv: Y :return: Z """ return 0.5*math.pow(xv, 2) + xv + 0.25*math.pow(yv, 2) - 2 # end func
We add some arguments to our program :
- The learning rate;
- The number of iteration;
- The x position of the starting point;
- The y position of the starting point;
# Arguments parser = argparse.ArgumentParser(prog=u"function_gradient") parser.add_argument(u"--learning-rate", type=float, required=True) parser.add_argument(u"--iterations", type=int, required=True) parser.add_argument(u"--x", type=float, default=1.0) parser.add_argument(u"--y", type=float, default=1.0) args = parser.parse_args() # Create two variables x = args.x y = args.y
The learning rate set the size of the step at each point, with a learning rate of 0.001, we will move in the direction of the gradient, multiplied by this learning rate. If the learning rate is to small, the convergence to the mimum will be to slow, in the contrary, if the learning rate is too big, an iteration could miss the minimum point and terminate in a point higher that before. Leading the algorithm to diverge from the minimum point.
We create three lists to start the results to display at the end.
# List of positions x_values = list([x]) y_values = list([y]) z_values = list([func(x, y)])
The main loop will iterate the number of time we specified in our command parameters.
# Do the iterations for i in range(args.iterations):
The compute the gradient we must find the partial derivative for each arguments. If you do not know how to do a derivative, find some course on the internet as it is mandatory to understand the underlying concepts of deep-learning and machine learning. To compute the partial derivate for a variable, we consider the other variable as a constant. For the variable, we compute the derivative for parts containing and consider as a constant, we then get the following result.
We do the same thing for the variable and we consider this time the variable as a constant. We have our second partial derivative.
With this, we can compute the gradient of each variable in our Python code.
# Compute gradient x_grad = x + 1 y_grad = 0.5 * y
The main idea of the gradient descent algorithm is to update to position in de direction of the gradient. The learning rate is here to module the importance of the deplacement.
# Update each parameters x -= x_grad * args.learning_rate y -= y_grad * args.learning_rate
We add the new position to our list and the corresponding z value of the function at this point.
# Print gradients and value x_values.append(x) y_values.append(y) z_values.append(func(x, y))
That’s all for the iteration loop. When the iterations are finised, we add a subplot to a matplotlib figure.
# Plot fig = plt.figure() ax = fig.add_subplot(111, projection='3d')
We want to display the function surface and the path followed by the gradient descend algorithm. To create a 3D surface with matplotlib, we need each point on the , and axis. We will set the range as -2.5 to 2.5 with a step of 0.1. We can use the Numpy’s function arange for X positions.
# X position X = np.zeros((51, 51)) X[:, ] = np.arange(-2.5, 2.6, 0.1)
We have a X matrix with size 51 by 51, with values going from -2.5 to 2.5 on one axis. We now want an Y matrix of the same size with values varying on the other axis.
# Y position Y = np.zeros((51, 51)) for i in range(51): Y[i, :] = X[0, i] # end for
We then compute corresponding value for each position.
# Compute Z Z = np.zeros((51, 51)) for j in range(51): for i in range(51): x_pos = X[j, i] y_pos = Y[j, i] Z[j, i] = func(x_pos,y_pos) # end for # end for
We add labels to the figure with axis names.
# Labels ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z')
We can now plot the 3D surface with our three matrices , and and the plot_surface() function.
# Plot a basic surface. ax.plot_surface(X, Y, Z, cmap=cm.hot, linewidth=0, antialiased=True, alpha=1.0)
And we use the function plot() function with the , , values saved during the iterations.
# Scatter points ax.plot(x_values, y_values, z_values, label='Learning curve', color='lightblue')
And we simply display the resulting figure with show(). The following plot show the path followed by the gradient descent algorithm with a learning rate of 0.1 and 10 iterations.
We cannot reach the minimum point, let’s try with a learning rate of 1.
The algorithm jumps directly near the minimum point, and slowly reach the minimum point. Let’s see what appends with a learning rate of 2.
The algorithm at the first step goes way to far a miss the minimum point, and is going higher than the preceding point.