//start - code added to show/hide solution
//end - code added to show/hide solution
Exercise 1: Logistic Regression
This course consists of videos and programming exercises to teach you about
unsupervised feature learning and deep learning. The exercises are designed to
give you hands-on, practical experience for getting these algorithms to work.
To get the most out of this course, you should watch the videos and complete
the exercises in the order in which they are listed.
This first exercise will give you practice with logistic regression.
These exercises have been extensively tested with Matlab, but
they should also work in
Octave, which has been called a "free version of Matlab."
If you are using Octave, be sure to install the Image package as well
(available for Windows as an option in the installer, and available
for Linux from
Octave-Forge
).
Download
ex1Data.zip, and extract the files from the zip file. If you're using Matlab or Octave, the data you'll need
for this exercise is contained in x.dat and y.dat. Each row
of the matrix corresponds to one training example (
), and the corresponding row of
is the class label. This dataset has training examples.
1. Batch gradient descent
In this first problem, you'll implement logistic regression using batch gradient descent.
(a) Recall that the logistic regression model is
and the batch gradient descent update rule is
Here, we are not using weight decay.
Implement batch gradient descent, and use it to find a good setting of
and
. Choose a
reasonable value of yourself, and note down the values of and found by your algorithm. (To verify that
your implementation is correct, later we'll ask you to check your values of and against ours.)
(b) Plot as a function of the number of iterations, and verify convergence of your algorithm. The
objective function is given by
In addition to generating this plot using the value of that you had chosen, also
repeat this exercise (re-initializaing gradient descent to
each time)
using
and .
2. Practice with stochastic gradient descent
(a) Implement stochastic gradient descent for the same logistic regression model as Question 1. For now,
leave the data in the original ordering, and do not shuffle the data. Recall that the stochastic descent learning
algorithm is:
Initialize
, use a learning rate of , and run
stochastic gradient descent so that it loops through your entire training set 5 times (i.e., execute
the outerloop above 5 times; since you have 2,000 training examples, this corresponds to 10,000 iterations
of stochastic gradient descent).
Run stochastic gradient descent, and plot the parameter as a function of the number of iterations taken.
In other words, draw a plot with 10,000 points, where the horizontal axis is the number of iterations
of stochastic gradient descent taken, and the vertical axis is the value of your parameter after
that many iterations. (Note that one "iteration" means performing a single update using a single
training example, not taking 5 updates using all of your training set.) Do you see this plot
having a ``wavy'' appearance? What do you think causes this?
(b) Repeat the previous problem, but now shuffle the training set first. If you are using Matlab/Octave,
you can use the randperm command. Specifically, randperm(2000) generates a random permutation
of the integers from 1 to 2000. Thus, this series of commands will shuffle your data:
Re-run stochastic gradient descent using the shuffled data, and replot as a function of the number of iterations.
How is this plot different from the previous one? Note down the values of and that you get.
(c) Use an exponentially-weighted average to keep track of an online estimate of while stochastic gradient
descent is running, and plot this online estimate as a function of the number of iterations.
Recall that the online estimate is obtained by initializing , and then updating it
(on each iteration of stochastic gradient descent) using a formula such as
Feel free also to play with the parameters ``0.999'' and ``0.001.''
Solutions
After you have completed the exercises above, please refer to the solutions below and check that your implementation and your
answers are correct. In a case where your implementation does not result in the same parameters/phenomena as described
below, please debug your solution until you manage to replicate the same effect as our implementation.
1. Batch gradient descent
(a) Verify that you get
,
, and . If your values are off by 0.05 or so, that's
fine, but if the values you get are significantly further off, you may need to play with or run it for a larger number of iterations.
Plotting as a function of the number of iterations (as in part (b)) may also help. There's a large range of learning rates
that we consider reasonable, but anything from to seems reasonable to us.
(b) Verify that your plot of is strictly decreasing as a function of the number of iterations, and that your cost
function converges to a value of about . If you don't get converging to close to this value, try
changing or running it for more iterations. You should
find that is too small (very slow convergence) and is too large (fails to converge), as shown in the graph below.
2. Practice with stochastic gradient descent
(a) You should see that this plot has a ``wavy'' appearance. This is because the training set has 1000 negative examples following 1000 positive
examples. Thus, the algorithm alternates between thinking that there're a lot of negative examples and thinking that there're lots
of positive training examples.
(b) By shuffling the data, the ``wavy'' appearance should be reduced and you should obtain better convergence of the parameters, as shown in the graph below.
Verify again that you get
,
, and . If your values are off by 0.1 or so, that's fine.
Even after shuffling the data, with stochastic gradient descent the parameters will oscillate slightly in the vicinity of the
global minimum. Note that while we asked you to only run 5 full iterations through the entire training set, we show the graphs for 10 iterations. As can be seen from the graphs, 5 iterations were enough for convergence.
(c) You should see your running estimate of decrease to about . As with the value of , with stochastic gradient descent the online
estimate of will also oscillate slightly about its true value, as shown below.