# Deep Learning

## Samy Bengio, Tom Dean and Andrew Ng

**Exercise: Regularization**

In this exercise, you will implement regularized linear regression and regularized logistic regression.

**Data**

To begin, download ex5Data.zip and extract the files from the zip file. This data bundle contains two sets of data, one for linear regression and the other for logistic regression. It also includes a helper function named 'map_feature.m' which will be used for logistic regression. Make sure that this function's m-file is placed in the same working directory where you plan to write your code.

**Regularized linear regression**

The first part of this exercise focuses on regularized linear regression and the normal equations.

**Plot the data**

Load the data files "ex5Linx.dat" and "ex5Liny.dat" into your program. These correspond to the "x" and "y" variables that you will start out with.

Notice that in this data, the input "x" is a single feature, so you can plot y as a function of x on a 2-dimensional graph (try it yourself):

From looking at this plot, it seems that fitting a straight line might be too simple of an approximation. Instead, we will try fitting a higher-order polynomial to the data to capture more of the variations in the points.

Let's try a fifth-order polynomial. Our hypothesis will be

This means that we have a hypothesis of six features, because are now all features of our regression. Notice that even though we are producing a polynomial fit, we still have a linear regression problem because the hypothesis is linear in each feature.

Since we are fitting a 5th-order polynomial to a data set of only 7 points, over-fitting is likely to occur. To guard against this, we will use regularization in our model.

Recall that in regularization problems, the goal is to minimize the following cost function with respect to :

The regularization parameter is a control on your fitting parameters. As the magnitues of the fitting parameters increase, there will be an increasing penalty on the cost function. This penalty is dependent on the squares of the parameters as well as the magnitude of . Also, notice that the summation after does not include

**Normal equations**

Now we will find the best parameters of our model using the normal equations.
Recall that the normal equations solution to regularized linear regression is

The matrix following is an diagonal matrix with a zero in the upper left and ones down the other diagonal entries. (Remember that is the number of features, not counting the intecept term). The vector and the matrix have the same definition they had for unregularized regression:

Using this equation, find values for using the three regularization parameters below:

**a. ** (this is the same case as non-regularized linear
regression)

**b. **

**c. **

As you are implementing your program, keep in mind that is an matrix, because there are training examples and features, plus an intercept term. In the data provided for this exercise, you were only give the first power of . You will need to include the other powers of in your feature vector , which means that the first column will contain all ones, the next column will contain the first powers, the next column will contain the second powers, and so on. You can do this in Matlab/Octave with the command

x = [ones(m, 1), x, x.^2, x.^3, x.^4, x.^5];

When you have found the answers for , verify them with
the values in the solutions. In addition to listing the values
for each element of the vector, we will
also provide the L2-norm of so you can quickly
check if your answer is correct. In Matlab/Octave, you can calculate the
L2-norm of a vector `x` using the command `norm(x)`.

Also, plot the polyomial fit for each value of . You will get plots similar to these:

From looking at these graphs, what conclusions can you make about how the regularization parameter affects your model?

**Regularized logistic regression**

In this 2nd part of the exercise, you will implement regularized logistic regression using Newton's Method. To begin, load the files 'ex5Logx.dat' and ex5Logy.dat' into your program. This dataset represents the training set of a logistic regression problem with two features. To avoid confusion later, we will refer to the two input features contained in 'ex5Logx.dat' as and . So in the 'ex5Logx.dat' file, the first column of numbers represents the feature , which you will plot on the horizontal axis, and the second feature represents , which you will plot on the vertical axis.

After loading the data, plot the points using different markers to distinguish between the two classifications. The commands in Matlab/Octave will be:

x = load('ex5Logx.dat'); y = load('ex5Logy.dat'); figure % Find the indices for the 2 classes pos = find(y); neg = find(y == 0); plot(x(pos, 1), x(pos, 2), '+') hold on plot(x(neg, 1), x(neg, 2), 'o')

After plotting your image, it should look something like this:

We will now fit a regularized regression model to this data.

Recall that in logistic regression, the hypothesis function is

Let's look at the parameter in the sigmoid function .

In this exercise, we will assign to be all monomials (meaning polynomial terms) of and up to
the sixth power:

To clarify this notation: we have made a 28-feature vector where . Remember that was the first column of numbers in your 'ex5Logx.dat' file and was the second column. From now on, we will just refer to the entries of as , , and so on instead of their values in terms of and .

To save you the trouble of enumerating all the terms of , we've included
a Matlab/Octave helper function named `'map_feature'`
that maps the original inputs to the feature vector. This function works for
a single training example as well as for an entire training. To
use this function, place 'map_feature.m' in your working directory and call

x = map_feature(u, v)

This assumes that the two original features were stored in column vectors named 'u' and 'v.' (If you had only one training example, each column vector would be a scalar.) The function will output a new feature array stored in the variable 'x.' Of course, you can use any names you'd like for the arguments and the output. Just make sure your two arguments are column vectors of the same size.

Before building this model, recall that our objective is to minimize the cost function in regularized logistic regression:

Notice that this looks like the cost function for unregularized logistic regression, except that there is a regularization term at the end. We will now minimize this function using Newton's method.

**Newton's method**

Recall that the Newton's Method update rule is

This is the same rule that you used for unregularized logistic regression
in Exercise 4. But because you are now implementing regularization,
the gradient
and the Hessian have different
forms:

Notice that if you substitute into these expressions, you will see the same formulas as unregularized logistic regression. Also, remember that in these formulas,

**1.** is your feature vector, which is a 28x1 vector in this exercise.

**2.**
is a 28x1 vector.

**3.**
and are 28x28 matrices.

**4.** and
are scalars.

**5.** The matrix following
in the Hessian formula
is a 28x28 diagonal matrix with a zero in the upper left and ones on every
other diagonal entry.

**Run Newton's Method **

Now run Newton's Method on this dataset using the three values of lambda below:

**a.** (this is the same case as non-regularized linear
regression)

**b.**

**c.**

To determine whether Newton's Method has converged, it may help to print out the value of during each iteration. should not be decreasing at any point during Newton's Method. If it is, check that you have defined correctly. Also check your definitions of the gradient and Hessian to make sure there are no mistakes in the regularization parts.

After convergence,
use your values of theta to find the decision boundary in
the classification problem. The decision boundary is defined as the line where

Plotting the decision boundary here will be trickier than plotting the best-fit curve in linear regression. You will need to plot the line implicity, by plotting a contour. This can be done by evaluating over a grid of points representing the original and inputs, and then plotting the line where evaluates to zero.

The plot implementation for Matlab/Octave is given below. To get the best viewing results, use the same plotting ranges that we use here.

% Define the ranges of the grid u = linspace(-1, 1.5, 200); v = linspace(-1, 1.5, 200); % Initialize space for the values to be plotted z = zeros(length(u), length(v)); % Evaluate z = theta*x over the grid for i = 1:length(u) for j = 1:length(v) % Notice the order of j, i here! z(j,i) = map_feature(u(i), v(j))*theta; end end % Because of the way that contour plotting works % in Matlab, we need to transpose z, or % else the axis orientation will be flipped! z = z' % Plot z = 0 by specifying the range [0, 0] contour(u,v,z, [0, 0], 'LineWidth', 2)

When you are finished, your plots for the three values of should look similar to the ones below.

Finally, because there are 28 elements , we will not provide an
element-by-element comparison in the solutions. Instead, use `norm(theta)`
to calculate the L2-norm of , and check it against the norm in the
solutions.