Exercise 1: Logistic Regression

This course consists of videos and programming exercises to teach you about unsupervised feature learning and deep learning. The exercises are designed to give you hands-on, practical experience for getting these algorithms to work. To get the most out of this course, you should watch the videos and complete the exercises in the order in which they are listed.

This first exercise will give you practice with logistic regression. These exercises have been extensively tested with Matlab, but they should also work in Octave, which has been called a "free version of Matlab." If you are using Octave, be sure to install the Image package as well (available for Windows as an option in the installer, and available for Linux from Octave-Forge ).

Download ex1Data.zip, and extract the files from the zip file. If you're using Matlab or Octave, the data you'll need for this exercise is contained in x.dat and y.dat. Each row of the $x$ matrix corresponds to one training example ( $x^{(i)} \in {\mathbb R}^2$), and the corresponding row of $y$ is the class label. This dataset has $m=2,000$ training examples.


  1. 1. Batch gradient descent

    In this first problem, you'll implement logistic regression using batch gradient descent.

    1. (a) Recall that the logistic regression model is
      \begin{displaymath}
h_{w,b}(x) = \frac{1}{1+\exp(-w^Tx - b)}, \nonumber
\end{displaymath}  

      and the batch gradient descent update rule is

      \begin{eqnarray*}
w_j &:=& w_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_{w,b}(x^{(i)...
... - \alpha \frac{1}{m} \sum_{i=1}^m (h_{w,b}(x^{(i)}) - y^{(i)}).
\end{eqnarray*}


      Here, we are not using weight decay. Implement batch gradient descent, and use it to find a good setting of $w \in {\mathbb R}^2$ and $b \in {\mathbb R}$. Choose a reasonable value of $\alpha$ yourself, and note down the values of $w$ and $b$ found by your algorithm. (To verify that your implementation is correct, later we'll ask you to check your values of $w$ and $b$ against ours.)

    2. (b) Plot $J(w,b)$ as a function of the number of iterations, and verify convergence of your algorithm. The objective function is given by
      \begin{displaymath}
J(w,b) =
\frac{1}{m} \left(
\sum_{i=1}^m -y^{(i)} \log h_{w...
...
-(1-y^{(i)}) \log (1 - h_{w,b}(x^{(i)}) ) \right). \nonumber
\end{displaymath}  

      In addition to generating this plot using the value of $\alpha$ that you had chosen, also repeat this exercise (re-initializaing gradient descent to $w = \vec{0}, b=0$ each time) using $\alpha = 0.01, 1$ and $20$.

    1. 2. Practice with stochastic gradient descent

    2. (a) Implement stochastic gradient descent for the same logistic regression model as Question 1. For now, leave the data in the original ordering, and do not shuffle the data. Recall that the stochastic descent learning algorithm is:

      $\mbox{Repeat } \{$
                $\mbox{for i }=1 \mbox{ to m, }\{$
                          $w_j := w_j - \alpha (h_{w,b}(x^{(i)})-y^{(i)}) x_j^{(i)} \;\;\;\;\;\mbox{(for all $j$)}$
                          $b := b - \alpha (h_{w,b}(x^{(i)})-y^{(i)})$
                $\}$
      $\}$

      Initialize $w = \vec{0}, b=0$, use a learning rate of $\alpha=0.01$, and run stochastic gradient descent so that it loops through your entire training set 5 times (i.e., execute the outerloop above 5 times; since you have 2,000 training examples, this corresponds to 10,000 iterations of stochastic gradient descent).

      Run stochastic gradient descent, and plot the parameter $b$ as a function of the number of iterations taken. In other words, draw a plot with 10,000 points, where the horizontal axis is the number of iterations of stochastic gradient descent taken, and the vertical axis is the value of your parameter $b$ after that many iterations. (Note that one "iteration" means performing a single update using a single training example, not taking 5 updates using all of your training set.) Do you see this plot having a ``wavy'' appearance? What do you think causes this?

    3. (b) Repeat the previous problem, but now shuffle the training set first. If you are using Matlab/Octave, you can use the randperm command. Specifically, randperm(2000) generates a random permutation of the integers from 1 to 2000. Thus, this series of commands will shuffle your data:

      myperm = randperm(2000);
      shuffledX = x(myperm, :);
      shuffledY = y(myperm);
      

      Re-run stochastic gradient descent using the shuffled data, and replot $b$ as a function of the number of iterations. How is this plot different from the previous one? Note down the values of $w$ and $b$ that you get.

    4. (c) Use an exponentially-weighted average to keep track of an online estimate of $J(w,b)$ while stochastic gradient descent is running, and plot this online estimate as a function of the number of iterations. Recall that the online estimate is obtained by initializing $\hat{J} = 0$, and then updating it (on each iteration of stochastic gradient descent) using a formula such as
      \begin{displaymath}
\hat{J} := 0.999\hat{J} + 0.001 \left( -y^{(i)}\log h_{w,b}(...
...i)}) - (1-y^{(i)}) \log(1-h_{w,b}(x^{(i)})) \right). \nonumber
\end{displaymath}  

      Feel free also to play with the parameters ``0.999'' and ``0.001.''