Post

Linear Regression

Notation

General NotationDescriptionPython (if applicable)
$\mathbf{x}$Training Feature vectorx_train
$\mathbf{y}$Training targets valuesy_train
$x^{(i)}$, $y^{(i)}$$i_{th}$ Training Examplex_i, y_i
$\hat{y}^{(i)}$Predicted value for $x^{(i)}$y_hat_i
$n$Number of training examplesm
$w$Model Parameter: weightw
$b$Model Parameter: biasb
$f_{w,b}(x^{(i)})$$f_{w,b}(x^{(i)}) = wx^{(i)}+b$f_wb
$\mathcal{J}(w,b)$Cost/Loss function evaluated at $w,b$J_wb

Linear Regression is at its core, curve fitting with a straight line. Given training data tuples $(x^{(i)}, y^{(i)})$, we want to find a straight line that best fits the data. The line is represented by the equation $y = wx + b$, where $w$ is the slope and $b$ is the y-intercept.

\[y^{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)}+b\]

The goal is to find the best values for $w$ and $b$ such that the line fits the data as closely as possible. We will see how this is done.

Cost/Loss Function

\[\mathcal{J}(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\]

Aim:

\[\min\limits_{w,b} \mathcal{J}(w,b)\]

For this we need gradient descent algorithm.

Gradient Descent

Repeat until convergence:

\[\begin{equation*}\begin{split} w \leftarrow w - \alpha \frac{\partial}{\partial w} \mathcal{J}(w,b) \\ b \leftarrow b - \alpha \frac{\partial}{\partial b} \mathcal{J}(w,b) \end{split}\end{equation*}\]

where $\alpha$ is the learning rate.

Simultaneous update of $w$ and $b$ is important. So, before update, store values of $\frac{\partial}{\partial w} \mathcal{J}(w,b)$ and $\frac{\partial}{\partial b} \mathcal{J}(w,b)$ in temporary variables.

Some important points to note:

  • If $\alpha$ is too small, gradient descent can be slow.
  • If $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
  • If $w$ reaches a minimum, then $\frac{\partial}{\partial w} \mathcal{J}(w,b) = 0$ and similarly for $b$, so no change in $w$ and $b$ will occur.

Can we somehow find $\alpha$ automatically? Yes, we can. It’s called learning rate decay. We will see that in the future.

Derivation of $\frac{\partial}{\partial w} \mathcal{J}(w,b)$ and $\frac{\partial}{\partial b} \mathcal{J}(w,b)$ $$\begin{equation*}\begin{split} \frac{\partial}{\partial w} \mathcal{J}(w,b) &= \frac{\partial}{\partial w} \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \\ &= \frac{1}{2m} \sum_{i=0}^{m-1} 2(f_{w,b}(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial w} (f_{w,b}(x^{(i)}) - y^{(i)}) \\ &= \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)} \\ \end{split}\end{equation*}$$ Similar, derivation for $\frac{\partial}{\partial b} \mathcal{J}(w,b)$.

Gradient Descent for Linear Regression

Repeat until convergence:

\[\begin{equation*}\begin{split} w \leftarrow w - \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)} \\ b \leftarrow b - \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \end{split}\end{equation*}\]

Imp points:

  • The cost function $\mathcal{J}(w,b)$ is a convex function, so there is only one minimum. Gradient descent will always converge to the global minimum (if the learning rate is not too large), irrespective of the initial values of $w$ and $b$.
  • The reason for that comes from statistics, I will try to include that in the future.
  • If the cost function is not convex, then gradient descent may converge to a local minimum, which may not be the global minimum.
  • This is called Batch Gradient Descent, because we use all the training examples to update $w$ and $b$ in each iteration.

Higher Dimensions & Vectorization

General NotationDescriptionPython (if applicable)
$a$scalar 
$\mathbf{a}$vector 
$\mathbf{A}$matrix 
$\mathbf{X}_{n\times d}$training example matrixX_train
$\mathbf{y}_{n\times 1}$training example targetsy_train
$\mathbf{x}^{(i)}_{d\times 1}$, $y^{(i)}$$i_{th}$ Training ExampleX[i], y[i]
$n$number of training examplesn
$d$number of features in each exampled
$\mathbf{w}$parameter: weightw
$b$parameter: biasb
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)}+b$f_wb

The feature matrix looks like this:

\[\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{d-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{d-1} \\ \vdots \\ x^{(n-1)}_0 & x^{(n-1)}_1 & \cdots & x^{(n-1)}_{d-1} \end{pmatrix}_{n\times d} = \begin{pmatrix} (\mathbf{x}^{(0)})^T \\ (\mathbf{x}^{(1)})^T \\ \vdots \\ (\mathbf{x}^{(m-1)})^T \end{pmatrix}\]
  • $\mathbf{x}^{(i)}_{d\times 1}$ is a vector containing example i, where $\mathbf{x}^{(i)} = (x^{(i)}_0, x^{(i)}_1, \cdots , x^{(i)}_d)^T$.
  • $x^{(i)}_j$ is element j in example i. The superscript in parenthesis indicates the training example number while the subscript represents an feature index.

The implementation of the model evaluation for all the training examples can be vectorized as follows:

\[\mathbf{f}_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w}^T \mathbf{x}^{(i)} + b = w_1x^{(i)}_1 + w_2x^{(i)}_2 + \cdots + w_{d-1}x^{(i)}_{d-1} + b\]

The special thing about numpy is that vectorized calculations are very efficient and done in parallel, so it’s very fast. It takes advantage of the underlying hardware, which is usually a CPU or a GPU.

Difference between vectorized and non-vectorized implementation:

1
2
3
4
5
6
7
# Non-vectorized
for i in range(n):
    y_hat[i] = w[0]*X[i][0] + w[1]*X[i][1] + ... + w[d-1]*X[i][d-1] + b

# Vectorized
y_hat = np.dot(X, w) + b
# Here numpy broadcasting is used which we will see a bit later.

In above code, np.dot is the dot product of two arrays. It uses all cores of the CPU to do the calculation in parallel. Also, it is implemented in C, so it’s very fast.

This post is licensed under CC BY 4.0 by the author.