Machine Learning

Thursday, May 22, 2025 · 37 min read

Page 1 of 53

Machine Learning

A computer program is said to learn from experience ( E )
with respect to some task ( T ) and some performance measure ( P )
if its performance on ( T ), measured by ( P ), improves with experience ( E ).

T = \text{Classifying emails as spam/not spam} E = \text{Watching me classify email as spam/not spam} P = \frac{\text{Correct classifications}}{\text{Total classifications}}

Types of Machine Learning

Supervised vs Unsupervised

Supervised
- Each training example has the associated input & the desired output.
- Examples:
  - Classification algorithms (predict discrete values).
  - Regression algorithms (predict continuous values).
Unsupervised
- Each training example has only the input.
- Used to uncover hidden structure & relations between training examples.
- Examples:
  - Clustering algorithms.

There are other types of learning, e.g., Reinforcement Learning (Recommendation systems).

Supervised Learning

Linear Regression

Training Set
( \text{Input} \rightarrow \text{Learning Algorithm} \rightarrow \text{Output (Hypothesis)} )

Single Input ( x )

Hypothesis:

h_{\theta}(x) = \theta_0 + \theta_1x

(Univariate linear regression)

Page 2 of 53

Parameters of the Model

Need to choose ( \theta_0, \theta_1 ).
Minimize the cost function (penalty):

J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2

Where:

( m ) = Number of training examples.
( h_{\theta}(x) ) = Predicted value.
( y ) = Actual output.

Intuition

( \theta_0 = 0 ), ( \theta_1 = 1 ):

( h_{\theta}(x) ) is linear.

( \theta_1 < 1 ):

( h_{\theta}(x) ) slope reduces.

( \theta_1 > 1 ):

( h_{\theta}(x) ) slope increases.

3D Plot:

( J(\theta_0, \theta_1) ) forms a convex quadratic function.
Sweet spot at minimum ( J(\theta_0, \theta_1) ).

Page 3 of 53

Contour Plots

Two-dimensional input giving one-dimensional output.
Representing 3D figure into 2D.

Steps:

Slice the 3D figure into multiple lines.
Squash all the contour lines on the X-Y plane (giving the same function).

Gradient Descent Algorithm

To find the "sweet spot," use Gradient Descent algorithm:

Start with some ( \theta_0, \theta_1, \dots ).
Keep changing ( \theta_0, \theta_1, \dots ) to reduce ( J(\theta_0, \theta_1) ).
End up at a minimum.

Algorithm:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)

Where:

( \alpha ) = Learning rate (step to take in some direction).

It is simultaneous update of ( \theta_0, \dots \theta_n ).

Intuition

Positive Slope:

( \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) )
Move down the slope.

Negative Slope:

( \theta_1 := \theta_1 + \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) )
Move up the slope.

Page 4 of 53

Linear Regression Model

Cost Function:

J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2

Partial Derivatives:

For ( \theta_0 ):

\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} = -\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})

For ( \theta_1 ):

\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} = -\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}

Gradient Descent always converges to global optimum for linear regression as ( J(\theta) ) is always a bowl-shaped function.

Linear Regression with Multiple Variables

Hypothesis:

h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_nx_n

Let ( x_0 = 1 ):

h_{\theta}(x) = \theta^T X

Where:

( X ) = [( x_0, x_1, \dots x_n )]
( \theta ) = [( \theta_0, \theta_1, \dots \theta_n )]

Page 5 of 53

Parameters and Cost Function

Parameters:

$\theta_0, \theta_1, \theta_2, \dots, \theta_n \quad \rightarrow \quad \text{n-dimensional vector}$

Cost Function:

$J(\theta_0, \theta_1, \theta_2, \dots, \theta_n) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta^{(i)} - y^{(i)}\right)^2$

Gradient Descent

Formula:

Repeat: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)}$

Simultaneously for (j = 0, 1, \dots, n).

Practical Implementation of Gradient Descent

a) Feature Scaling:

If we have multiple features, making sure that all features are on the same scale makes convergence faster and better.

Example:

( x_1 ): ( (0-2000) ), ( x_2 ): ( (1-5) ) → Skewed
( x_1 ): ( (0-1) ), ( x_2 ): ( (0-1) ) → Less Skewed

Diagram:

graph LR
    WithoutFeatureScaling --> Skewed
    WithFeatureScaling --> LessSkewed

Method:

To get every feature into approximately a ( -1 \leq x \leq 1 ) range: $x_i := \frac{x_i - \mu_i}{s_i}$ Where:

( \mu_i ) → Mean of ( x_i )
( s_i ) → Range of values (( \text{max} - \text{min} )) or Standard Deviation

b) Learning Rate:

( \alpha ) should not be too small or too large.
Convergence Test: ( J(\theta) ) should decrease after some iterations.

Diagram:

graph LR
    JTheta --> NumberOfIterations
    SmallAlpha --> SlowConvergence
    LargeAlpha --> Divergence

Page 6 of 53

Polynomial Regression

Example:

Let’s take an example of housing price for linear regression: $h_\theta(x) = \theta_0 + \theta_1 \times \text{frontage} + \theta_2 \times \text{depth}$

Where:

( x_3 = \text{frontage} \times \text{depth} )

For quadratic polynomial: $h_\theta(x) = \theta_0 + \theta_1 x_3$

This idea is called Polynomial Regression.

Feature scaling becomes very important in Polynomial Regression.

Alternative to Gradient Descent: Normal Equations

Minimize ( J(\theta) ):

$\frac{\partial}{\partial \theta_j} J(\theta) = 0 \quad (\text{for every } j)$

After deriving all, we get: $\theta = \left(X^T X\right)^{-1} X^T y$

Complexity:

( O(n^{12}) ) for ( n = 10^6 )

Why not Normal Equations?

Doesn’t work fast if there are large number of features (( n > 10^6 )).
( X^T X ) may (or may not) be invertible/singular: $|X^T X| = 0$

Causes:

Redundant Features (( x_1 = x_2 )).
Too many features (( m \leq n )) → Requires Regularization.

Page 7 of 53

Classification Problems

Definition:

Predict ( y ) for discrete values.

Examples:

Email: Spam / Not Spam
Transactions: Fraud (Yes / No)?

$y \in \{0, 1\}$ Where:

( 0 ): "Negative Class"
( 1 ): "Positive Class"

Binary Classification Algorithm

Logistic Regression:

$0 \leq h_\theta(x) \leq 1$

Classification:

$h_\theta(x) = g(\theta^T x)$

Where ( g(z) ) is the Sigmoid Logistic Function: $g(z) = \frac{1}{1 + e^{-z}}$

Summary:

This section covers gradient descent, feature scaling, polynomial regression, normal equations, and binary classification with logistic regression.

References & Related Topics:

Gradient Descent Optimization
Polynomial Regression Techniques
Regularization in High-Dimensional Spaces
Logistic Regression for Binary Classification

Page 9 of 53

Hypothesis Function

![Graph of g(z)](INSERT IMAGE)

We aim to fit parameters ( \theta ) into this function ( g(z) ).

What is ( h_\theta(x) ) and its significance?

It represents the estimated probability that ( y = 1 ) given the input ( x ), i.e.,

h_\theta(x) = P(y = 1 | x; \theta)

The probabilities satisfy:

P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1

Prediction Rule:

Predict ( y = 1 ) if ( h_\theta(x) \geq 0.5 )
( \Rightarrow \theta^Tx \geq 0 )
Predict ( y = 0 ) if ( h_\theta(x) < 0.5 )
( \Rightarrow \theta^Tx < 0 )

Property of Hypothesis Function:

This hypothesis function creates decision boundaries that separate ( x ) and ( y ), enabling classification.

Page 10 of 53

Cost Function

Linear Regression Cost Function:

For logistic regression, if we use the linear regression cost function:

J(\theta) = \frac{1}{m} \sum_{i=1}^m \frac{1}{2} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2

However, this results in a non-convex cost function ( J(\theta) ), which is undesirable:

graph TD
    A[Non-convex Cost Function] --> B[Convex Cost Function]

Logistic Regression Cost Function:

To avoid this, we use the following cost function:

\text{Cost}(h_\theta(x), y) = \begin{cases} - \log(h_\theta(x)) & \text{if } y = 1 \\ - \log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases}

Graphical Representation:

graph TD
    C[Case 1 (y=1)] --> D[Cost reduces sharply as h_theta(x) approaches 1]
    E[Case 2 (y=0)] --> F[Cost reduces sharply as h_theta(x) approaches 0]

Page 11 of 53

Combining Both Cases

To generalize the cost function for both cases:

\text{Cost}(h_\theta(x), y) = y \log(h_\theta(x)) + (1 - y) \log(1 - h_\theta(x))

This function has the property of convexity and uses maximum likelihood estimation.

Optimization:

To minimize the cost function ( J(\theta) ), we need to fit parameters ( \theta ):

J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

We achieve this using gradient descent.

Page 12 of 53

Gradient Descent

Update Rule:

The parameters ( \theta_j ) are updated iteratively:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

Partial Derivatives:

The derivative of ( J(\theta) ) with respect to ( \theta_j ) is:

\frac{\partial J(\theta)}{\partial \theta_j} = - \frac{1}{m} \sum_{i=1}^m \left[ x_j^{(i)} \left( y^{(i)} - h_\theta(x^{(i)}) \right) \right]

Substituting ( h_\theta(x) = \frac{1}{1 + e^{-z}} ):

\frac{\partial J(\theta)}{\partial \theta_j} = - \frac{1}{m} \sum_{i=1}^m \left[ x_j^{(i)} \left( y^{(i)} - \frac{1}{1 + e^{-z}} \right) \right]

Further simplifications yield:

\frac{\partial J(\theta)}{\partial \theta_j} = - \frac{1}{m} \sum_{i=1}^m \left[ x_j^{(i)} \left( y^{(i)} \cdot (1 + e^{-z} - 1) - \frac{1}{1 + e^{-z}} \right) \right]

The iterative update ensures convergence to the optimal parameters.

Page 13 of 53

Mathematical Derivations

Derivative of Cost Function:

\frac{\partial J(\theta)}{\partial \theta_0} = \frac{-1}{m} \sum_{i=1}^{m} \Big[ x^{(i)} \Big( \frac{1}{1+e^{-z}} - y^{(i)} \Big) \Big]

\frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \Big[ h_\theta^{(i)}(x^{(i)}) - y^{(i)} \Big] x^{(i)}

Other Optimization Algorithms

Gradient Descent
Conjugate Gradient
BFGS
L-BFGS

Advantages:

No need to manually pick a learning rate.
Often faster than gradient descent.

Disadvantages:

More complex.

Multi-Class Classification Algorithms

Examples:
- Email categorization: Tagging work, friends, family.
- Medical diagnosis: Not ill, cold, flu.

Binary Classification Diagram:

graph TD
    A[Binary Classification] --> B[Decision Boundary]
    B --> C[x1]
    B --> D[x2]
    D --> E[Points classified as class A and class B]

Page 14 of 53

Multi-Class Classification: Breaking into Sub-Problems

For multi-class classification, create new problems by separating different sets of data:

Diagrams:

Example 1:

graph TD
    A[x1] --> B[Class A]
    A --> C[Class B]
    A --> D[Class C]

Example 2:

graph TD
    A[x1] --> B[Class 1]
    A --> C[Class 2]
    A --> D[Class 3]

Summarized Approach:

Train a logistic regression classifier ( h_\theta^{(i)}(x) ) for each class ( i = 1, 2, ..., k ). Predict the probability that ( y = i ).
On a new input ( x ), make a prediction by picking the class:

\text{class} = \arg \max_{i} h_\theta^{(i)}(x)

Page 15 of 53

Regularization

I. Problem of Over-Fitting

For linear regression:

Underfit (High Bias): $\theta_0 + \theta_1 x$
Overfit (High Variance): $θ_{0} + θ_{1} x + θ_{2} x^{2} + θ_{3} x^{3} + . .$

For logistic regression:

Underfit (High Bias): $\theta_0 + \theta_1 x_1 + \theta_2 x_2$

II. Addressing Over-Fitting

Reduce Number of Features
- Manually select which features to keep.
- Use model selection algorithms.
Regularization
- Keep all the features, but reduce the magnitude of parameters ( \theta_j ).

Page 16 of 53

Regularization: Cost Function

For linear regression:

J(\theta) = \frac{1}{2m} \Bigg[ \sum_{i=1}^{m} \Big( h_\theta(x^{(i)}) - y^{(i)} \Big)^2 \Bigg] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2

Where ( \lambda ) is the regularization parameter.

Gradient Descent:

Repeat:

\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) x_0^{(i)}

\theta_j = \theta_j - \alpha \Bigg[ \frac{1}{m} \sum_{i=1}^{m} \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) x_j^{(i)} + \frac{\lambda}{m} \theta_j \Bigg]

Where ( \alpha ) is the learning rate.

Normal Equation:

\theta = \Big( X^T X + \lambda \begin{bmatrix} 0 & 0 \\ 0 & I \end{bmatrix} \Big)^{-1} X^T y

If ( \lambda > 0 ), then the above matrix is invertible.

Page 17 of 53

For Logistic Regression

The cost function ( J(\theta) ) for logistic regression:

J(\theta) = \frac{-1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \left( h_{\theta}(x^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - h_{\theta}(x^{(i)}) \right) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 \quad (\text{if } \theta_0)

Gradient Descent

Gradient descent is used for optimization, and is same as regularized linear regression.

Neural Networks

Why?

Consider a regression/classification problem with a large number of features ( n > 1000 ).
This might blow feature space as total number of possible combinations for all cubic terms is ( \binom{n}{3}^3 ).

Neural Networks: Algorithms that attempt to mimic the brain's structure.

Diagram:

Illustration of a neuron in the brain:

%% Placeholder for neuron structure diagram
%% Diagram type: Architecture Diagram
(INSERT IMAGE)

Page 18 of 53

Representation

Neuron Model

A logistic unit is represented as:

Bias Unit ( x_0 = 1 )

Diagram of neuron model:

%% Placeholder for neuron model diagram
%% Diagram type: Block Diagram
(INSERT IMAGE)

Sigmoid (Logistic) Activation Function

The activation function is used for logistic regression and neural networks.

Neural Network (Artificial)

Illustration of a neural network:

%% Placeholder for artificial neural network diagram
%% Diagram type: Architecture Diagram
(INSERT IMAGE)

Layers

Input Layer
Middle Layer
Output Layer

Notation

( a_j^{(i)} ): "activation" of unit ( i ) in layer ( j ).
( \theta^{(i)} ): matrix of weights controlling function mapping from layer ( j ) to layer ( j+1 ).

Page 19 of 53

Formulas and Dimensions

Activation formulas across layers:

a_1^{(2)} = g\left(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3\right)

a_3^{(2)} = g\left(\theta_{30}^{(1)} x_0 + \theta_{31}^{(1)} x_1 + \theta_{32}^{(1)} x_2 + \theta_{33}^{(1)} x_3\right)

h_\theta(x) = a_3^{(3)} = g\left(\theta_{10}^{(2)} a_0^{(2)} + \theta_{11}^{(2)} a_1^{(2)} + \theta_{12}^{(2)} a_2^{(2)} + \theta_{13}^{(2)} a_3^{(2)}\right)

Dimensions of ( \theta^{(i)} ):

( \theta^{(i)} ) has dimension ( s_{j+1} \times (s_j + 1) ).

Activation Calculation:

a_j^{(i)} = g(z_j^{(i)})

Where:

z_j^{(i)} = \theta^{(i)} \cdot a^{(i-1)} \quad \text{(weights × activation)}

Page 20 of 53

Algorithm

For Classification:

For ( m ) training examples with ( l )-layers:
( s_j ): number of units (not counting bias unit) in layer ( j ).

1. Binary Classification

Output Units:
- ( s_{l} = 1 )
- ( k = 1 )

2. Multi-Class Classification

Output Units:
- ( s_{l} = k \quad (k \geq 3) )

Cost Function

For multi-class classification:

J (θ) = \frac{- 1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(i)} \log (h_{θ}^{(i)} (x)) + (1 - y_{k}^{(i)}) \log (1 - h_{θ}^{(i)} (x))] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l + 1}} \sum_{j = 1}^{s_{l} + 1} {(θ_{}^{}}^{}

Goal:

Minimize ( J(\theta) ):

\theta = \min J(\theta)

Page 21 of 53

Backpropagation Algorithm

(Calculates gradient efficiently with SGD uses)

Definitions

( S_j^{(l)} ): "Error" of node ( j ) in layer ( l ).

For layer ( l ):
[ S_j^{(l)} = (a_j^{(l)} - y_j) \cdot h\theta(x_j) ]

Diagram

graph LR
    Layer1 --> Layer2
    Layer2 --> Layer3
    Layer3 --> Output

For Earlier Layers

[ S^{(3)} = (\theta^{(4)})^T \cdot S^{(4)} \cdot g'(z^{(3)}) \quad \text{where} \quad g'(z^{(3)}) = a^{(3)} \cdot (1 - a^{(3)}) ]

[ S^{(2)} = (\theta^{(3)})^T \cdot S^{(3)} \cdot g'(z^{(2)}) ]

Additionally: [ S_j^{(2)} \cdot \delta_j^{(2)} = \frac{\partial z_j^{(2)}}{\partial \theta_j^{(2)}} \quad \text{(at ( i ))}. ]

If ( \lambda = 0 ): [ \frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)}} = a_j^{(l)} \cdot S_j^{(l+1)}. ]

Activation Function Derivative

[ g'(z^{(2)}) = g(z^{(2)}) \cdot (1 - g(z^{(2)})) = a^{(2)} \cdot (1 - a^{(2)}). ]

Page 22 of 53

Backpropagation (Again)

Definition

It is an algorithm for calculating the gradient for one training example.

Example

Consider one training example.
Recognize a 2 in the last layer.

Diagram

graph LR
    Layer1 --> Output
    Output[Recognizes 0, 2]

What Can We Do?

Increase ( b ):
Increase ( w_i ): In proportion to ( a_i ).
Change ( a_i ): ~~Can't directly influence this one~~ → Corrected: Cannot directly influence this one.

Now, all desires of output neurons are added to previous weights & biases.
Apply it recursively → This is called backpropagation.

Practical Note:

In practice, it takes a long time to add up influences for every training example for every gradient.

Page 23 of 53

Stochastic Gradient Descent (SGD)

Key Point

SGD takes mini-batches and computes each step using batch gradient descent.

Gradient Descent

Gradient descent: Will average all gradients for every training example.

Onto Calculus

Cost Function:
[ C_0 = (a^{(L)} - y)^{2} ]

Equations:
[ z^{(L)} = \theta^{(L)} a^{(L-1)} + b^{(L)}. ]

[ a^{(L)} = \sigma(z^{(L)}). ]

Diagram

graph TD
    Input --> HiddenLayer1
    HiddenLayer1 --> HiddenLayer2
    HiddenLayer2 --> Output

Change in Cost w.r.t. Weights: [ \frac{\partial C_0}{\partial \theta^{(L)}} = \sum_{k=1}^{K} a^{(L-1)} \cdot \sigma'(z^{(L)}) \cdot [2(a^{(L)} - y)]. ]

Change in Cost w.r.t. Bias: [ \frac{\partial C_0}{\partial b} = 1 \cdot \sigma'(z^{(L)}) \cdot [2(a^{(L)} - y)]. ]

Page 24 of 53

New: Propagating Backwards

Backpropagation Algorithm (Summarized)

Training Set:
[ { (x^{(m)}, y^{(m)}), \ldots (x^{(m)}, y^{(m)}) }. ]
Initialize: [ \Delta_{ij}^{(l)} = 0 \quad \text{(for all ( l, i, j ))}. ]
For ( i = 1 ) to ( m ):
- Set ( a^{(1)} = x^{(i)} ).
- Perform forward propagation to compute ( a^{(L)} ) for ( l = 2, 3, \ldots, L ).
- Compute ( C_0 = a^{(L)} - y^{(i)} ).
- Compute ( S^{(l)} ): [ S^{(l)} = a^{(l)} \cdot \sigma'(z^{(l)}). ]
Update: [ \Delta_{ij}^{(l)} = \Delta_{ij}^{(l)} + a_i^{(l)} \cdot S_j^{(l+1)}. ]
Compute Gradients:
- If ( j \neq 0 ): [ \frac{\partial J}{\partial \theta_{ij}^{(l)}} = \frac{1}{m} \Delta_{ij}^{(l)} + \lambda \theta_{ij}^{(l)}. ]
- If ( j = 0 ) (Bias unit): [ \frac{\partial J}{\partial \theta_{ij}^{(l)}} = \frac{1}{m} \Delta_{ij}^{(l)}. ]

Page 25 of 53

Implementation Notes

Matrix to Vector Conversion
- You can convert matrices to vectors and small streams as suited for needs.
Gradient Checking
- You can do gradient checking which can confirm if backpropagation is working or not: $\text{gradApprox} = \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$ for ( i = 1 ) to ( n ).

Putting It All Together

Select a network architecture:
- Number of input units → Dimension of features ( n^{(i)} )
- Number of output units → Number of classes ( k )
- Number of hidden units → ( l ) (or more), and have same number of hidden units in every layer.
Randomly initialize weights
Implement forward propagation to get ( h_\Theta^{(i)} ) for any ( i ).
Implement code to compute cost function ( J(\Theta) ).
Implement backpropagation to compute partial derivatives ( \frac{\partial J}{\partial \Theta_{ij}^{(l)}} )
- For ( i = 1 ) to ( m ):
  Perform forward and backpropagation using ( x^{(i)}, y^{(i)} ).
  Get activations ( a^{(l)}, z^{(l)} ), and delta terms ( \delta^{(l)} ) for ( l = 2, 3, \dots, L ).
Use gradient checking to compare gradients, then disable gradient checking.
Use gradient descent or advanced optimization method with backpropagation to try to minimize ( J(\Theta) ) as a function of parameters ( \Theta ).

Page 26 of 53

Advice on How to Apply ML Algorithms

Debugging a Learning Algorithm

Example:
Suppose for a problem (housing prices), we have implemented regularized linear regression.
However, when we test our hypothesis on a new set of houses, it makes unacceptably large errors in its predictions.

Options to Fix Them:

Solution	Answer
Get more training examples	Fix high variance
Try smaller set of features	Fix high variance
Try getting additional features	Fix high bias
Try adding polynomial features	Fix high bias
Try decreasing ( \lambda )	Fix high bias
Try increasing ( \lambda )	Fix high variance

These may take a long time, so we use something called Machine Learning Diagnostic.

Evaluating a Hypothesis

Divide dataset into examples:
Randomized split:
- Training set: ( 70% )
- Test set: ( 30% )
Learn parameters ( \Theta ) from training data
Compute test set error:
$\Theta_{test} = \frac{1}{2m_{test}} \sum_{i=1}^{m_{test}} \left( h_\Theta^{(i)}(x_{test}^{(i)}) - y_{test}^{(i)} \right)^2$

Page 27 of 53

Dividing Polynomial Parameters and Lambda (Model Selection)

Consider 10 Hypotheses:

( h_\Theta(x) = \Theta_0 + \Theta_1x )
( h_\Theta(x) = \Theta_0 + \Theta_1x + \Theta_2x^2 )
...
( h_\Theta(x) = \Theta_0 + \Theta_1x + \Theta_2x^2 + \dots + \Theta_{10}x^{10} )

If 5th hypothesis is selected, this will not generalize our model as it is practical evaluation only on test set.

Alternative Method:

Divide dataset into 3 parts:
- Training
- Cross-validation
- Test
Find all three errors.
Run all hypotheses (( h_\Theta )):
Pick lowest ( J(\Theta) ) for cross-validation set.
Estimate generalization error for test set ( J_{test}(\Theta^{(l)}) ).

Page 28 of 53

IV. Diagnosing Bias vs Variance

Training Error:

$J_{\text{train}}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2$

Cross Validation Error:

$J_{\text{CV}}(\theta) = \frac{1}{2m_{\text{CV}}} \sum_{i=1}^{m_{\text{CV}}} \left(h_{\theta}(x_{\text{CV}}^{(i)}) - y_{\text{CV}}^{(i)}\right)^2$

Graphs:

flowchart TD
    A[Bias (High Bias)] -->|Underfit| B[Error ↑]
    C[Variance (High Variance)] -->|Overfit| D[Error ↑]

Error vs Degree of Polynomial (d):
- High Bias: Both training error ( J_{\text{train}}(\theta) ) and cross-validation error ( J_{\text{CV}}(\theta) ) are high.
- High Variance: ( J_{\text{train}}(\theta) ) is low, but ( J_{\text{CV}}(\theta) ) is high.

If Model Suffers From Bias:

( J_{\text{train}}(\theta) ) ↑
( J_{\text{CV}}(\theta) ) ↑
( J_{\text{CV}}(\theta) \approx J_{\text{train}}(\theta) )

Else If Model Suffers From Variance:

( J_{\text{train}}(\theta) ) ↓
( J_{\text{CV}}(\theta) ) ↑
( J_{\text{CV}}(\theta) \gg J_{\text{train}}(\theta) )

Page 29 of 53

V. Regularization & Bias-Variance

Model:

$h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \dots + \theta_n x^n$

Cost Function:

$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2$

Bias-Variance Tradeoff:

Large (\lambda): High Bias (Underfit)
Intermediate (\lambda): Just Right
Small (\lambda): High Variance (Overfit)

Example:

(\lambda = 1000): ( \theta_1 \approx 0, \theta_2 \approx 0 )
( h_\theta(x) \approx \theta_0 )

Choosing Good Value for (\lambda):

Experiment with values: (0, 0.01, 0.02, 0.04, \dots)
Multiply by 2 → Calculate ( J_{\text{CV}}(\theta) )

Graph:

flowchart TD
    A[High Bias] -->|Small Lambda| B[High Variance]
    B -->|Optimal Lambda| C[Minimized Error]

Page 30 of 53

VI. Learning Curves

Training Error:

$J_{\text{train}}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2$

Cross Validation Error:

$J_{\text{CV}}(\theta) = \frac{1}{2m_{\text{CV}}} \sum_{i=1}^{m_{\text{CV}}} \left(h_{\theta}(x_{\text{CV}}^{(i)}) - y_{\text{CV}}^{(i)}\right)^2$

Graphs:

flowchart TD
    A[High Bias] -->|No improvement| B[Error stays high]
    C[High Variance] -->|Training data helps| D[Error reduces]

Learning Algorithm Suffers From High Bias:
- Will not perform well even with more training data.
Learning Algorithm Suffers From High Variance:
- Getting more training data is likely to help.

Page 31 of 53

VII. Selecting Neural Networks

Small Neural Networks:

Fewer parameters
Computationally cheaper
More prone to underfitting

Large Neural Networks:

More parameters
Computationally expensive
More prone to overfitting
Use regularization to mitigate overfitting.

Page 32 of 53

Notes:

Other than these algorithms, there are many more.
Project will depend on data & capabilities.

VIII. Support Vector Machines (SVM)

Alternate View of Logistic Regression:

$h_\theta(x) = \frac{1}{1 + e^{-z}}$

Cases:

If ( y = 1 ), we want ( h_\theta(x) \approx 1 ), ( \theta^Tx \geq 0 )
If ( y = 0 ), we want ( h_\theta(x) \approx 0 ), ( \theta^Tx \leq 0 )

Cost Function:

$J = -\left[y \log(h_\theta(x)) + (1-y) \log(1-h_\theta(x))\right]$

SVM Cost Function:

$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)} \text{cost}_1(z^{(i)}) + (1-y^{(i)}) \text{cost}_0(z^{(i)})\right] + \frac{\lambda}{2} \sum_{j=1}^{n} \theta_j^2$

Graphs:

flowchart TD
    A[Cost1(z)] -->|Case 1| B[Cost0(z)]
    B -->|Regularization| C[New Cost Function]

Page 33 of 53

SVM Cost Function and Parameters

Cost Function

The cost function ( J(\theta) ) is defined as:

J(\theta) = A + 2B

In another trade-off scenario:

J(\theta) = C A + B

Adding regularization parameters:

\min_C \sum_{i=1}^m \left[ y^{(i)} \cdot \text{cost}(\theta^Tx) + (1-y^{(i)}) \cdot \text{cost}(\beta^Tx) \right] + \frac{1}{2} \sum_{j=1}^n \theta_j^2

Hypothesis

The hypothesis ( h_\theta(x) ) is defined as:

h_\theta(x) = \begin{cases} 1 & \text{if } \theta^Tx \geq 0 \\ 0 & \text{otherwise} \end{cases}

Notes:

If ( y=1 ), we want ( \theta^Tx \geq 1 ) (not just ( > 0 )).
- Or some cutoff value.
If ( y=0 ), we want ( \theta^Tx \leq -1 ) (not just ( < 0 )).

Safety Factor in SVM

This allows SVM to incorporate a safety factor into the model.
Mathematically, it enables SVM to achieve a better decision boundary (large margin).
Hence, it's called Large Margin Classifier.

Important Note:

In cases of outliers, ( C ) must not be too large for the large margin classifier.

Page 34 of 53

Kernels in SVM

Introduction to Kernels

Given ( x_1 ), compute new features depending on proximity to landmarks ( l^{(1)}, l^{(2)}, \dots, l^{(m)} ).

Gaussian Kernel

For a given ( x ):

( f_1 = \text{similarity}(x, l^{(1)}) = \exp\left(-\frac{|x - l^{(1)}|^2}{2\sigma^2}\right) )
( f_2 = \text{similarity}(x, l^{(2)}) )
( f_3 = \text{similarity}(x, l^{(3)}) )

Kernels (Gaussian Kernel):

K(x, l) = \exp\left(-\frac{\|x - l\|^2}{2\sigma^2}\right)

SVM with Kernel

Given: ( x^{(1)}, x^{(2)}, \dots, x^{(m)} ) and ( y^{(1)}, y^{(2)}, \dots, y^{(m)} ).
Choose landmarks: ( l^{(1)}, l^{(2)}, \dots, l^{(m)} ).
Compute:
- ( f_1 = \text{similarity}(x, l^{(1)}) )
- ( f_2 = \text{similarity}(x, l^{(2)}) ), etc.
Predict:
- ( y = 1 ) if ( \theta^T f > 0 ).
Training: $\min_\theta \sum_{i=1}^m \left[ y^{(i)} \cdot \text{cost}(\theta^T f^{(i)}) + (1-y^{(i)}) \cdot \text{cost}(\beta^T f^{(i)}) \right] + \frac{1}{2} \sum_{j=1}^n \theta_j^2$

Page 35 of 53

SVM Parameters and Implementation Notes

Kernel Limitations

Kernels don't perform well with logistic regression due to computational overhead (it's slow in logistic regression).

SVM Parameters:

C: Trade-off parameter $C = \frac{1}{\lambda}$

Large ( C ):
- Lower Bias, High Variance.
Small ( C ):
- Higher Bias, Low Variance.

Variance and Bias:

Large ( \sigma^2 ):
- Features are very smooth, higher bias, lower variance.
Small ( \sigma^2 ):
- Lower bias, higher variance.

Implementation Notes:

Case I: ( n ) large relative to ( m ):
- ( n = 10,000 ), ( m = 10-1000 ).
- Use logistic regression or SVM without kernel (efficient).
Case II: ( n ) small, ( m ) intermediate:
- ( n = 1-1000 ), ( m = 10-10,000 ).
- Use SVM with Gaussian kernel.
Case III: ( n ) small, ( m ) large:
- ( n = 1-1000 ), ( m = 50,000 ).
- Create/add more features, then use logistic regression or SVM without kernel.

Note:

Neural networks are likely to work well for most settings but may be slower to train.

Page 36 of 53

Unsupervised Learning

Clustering:

Training Set: ( x_1, x_2, \dots, x_m )

Algorithm: K-Means

Input:
- ( K ) (Number of Clusters) → To be decided.
- Training Set: ( x^{(1)}, x^{(2)}, \dots, x^{(m)} ).
Algorithm Steps:
- ( x^{(1)} \in \mathbb{R}^n ) (Assume ( x_0 = 1 ) convention).
- Randomly Initialize: ( K ) cluster centroids ( \mu_1, \mu_2, \dots, \mu_K \in \mathbb{R}^n ).

Page 37 of 53

K-Means Algorithm

Algorithm Steps

Repeat:
- For (i = 1) to (m):
  - ( c^{(i)} ): Index (from 1 to (k)) of cluster centroid closest to (x^{(i)}).
- For (k = 1) to (K):
  - ( \mu_k ): Average (mean) of points assigned to cluster (k).

Other Applications of K-Means

Non-separated clusters:
- Example: T-shirt sizing based on height and weight.
- Clusters: Small (S), Medium (M), Large (L).
- Choose (k = 3).

graph TD
    A[Height] --> B[Small (S)]
    A --> C[Medium (M)]
    A --> D[Large (L)]

Optimization Objective

Cluster Assignment:
- ( c^{(i)} ): Index of cluster ((1, 2, \ldots, k)) to which example (x^{(i)}) is currently assigned.
- ( \mu_{c^{(i)}} ): Cluster centroid of cluster to which example (x^{(i)}) has been assigned.
Mathematical Formulation:
$J(c^{(1)}, c^{(2)}, \ldots, c^{(m)}, \mu_1, \mu_2, \ldots, \mu_k) = \frac{1}{m} \sum_{i=1}^{m} \| x^{(i)} - \mu_{c^{(i)}} \|^2$

Page 38 of 53

Random Initialization

Guidelines:
- (k < m).
- Randomly pick (k) training examples.
- Set (\mu_1, \ldots, \mu_k) equal to these (k) examples.
Initialization:
- Multiple runs to avoid bad luck.
- Example:
  1. Run 1 → 100 iterations.
  2. Randomly initialize (k)-means.
  3. Compute cost function.
  4. Choose initialization with lowest cost function.

Dimensionality Reduction

1. Why Dimensionality Reduction?

i) Data Compression:

Objective: Reduce data dimensions (e.g., 2D → 1D).
Example:
- (x_1) → height (cm).
- (x_2) → weight (kg).

graph TD
    A[Data (2D: Height & Weight)] --> B[Compressed Data (1D)]

ii) Data Visualization:

Problem:
- (x^{(i)} \in \mathbb{R}^n) is not helpful.
- Reduce data from (50D) to (2D).
- Result: (z^{(i)} \in \mathbb{R}^2).

Algorithm: Principal Component Analysis (PCA)

Problem:

Reduce (2D \to 1D).
Find a direction ((c)-vector) onto which to project the data to minimize projection error.

\text{Projection Error} = \sum \| x^{(i)} - \text{projection}(x^{(i)}) \|^2

graph TD
    A[Original Data (2D)] --> B[Projected Data (1D)]

Generalized Idea:
- Reduce (nD \to kD).
- Find (k) vectors onto which to project the data to minimize projection error.

Page 39 of 53

Data Preprocessing

Training Set:

(x^{(1)}, x^{(2)}, \ldots, x^{(m)}).
Feature Scaling & Mean Normalization:
- Mean: $\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)}$

Note:

Different features on different scales (e.g., number of bedrooms vs. area).
Scale features to have comparable ranges.

PCA Steps

1. Compute Covariance Matrix:

\Sigma = \frac{1}{m} \sum_{i=1}^{m} x^{(i)} (x^{(i)})^T

2. Compute Eigen Vectors:

Use eigen decomposition to find vectors:

[U, S, V] = \text{eig}(\Sigma)

3. Reduce Dimensions:

Select top (k) eigenvectors:

U_{\text{reduce}} = U(:, 1:k)

4. Project Data:

z = U_{\text{reduce}}^T \cdot x^{(i)}

Choosing (k) (Number of Dimensions)

Metrics:

Average squared projection error:
$\frac{1}{m} \sum_{i=1}^{m} \| x^{(i)} - x^{(i)}_{\text{approx}} \|^2$

Page 41 of 53

Dimensionality Reduction (PCA)

Key Idea:

Typically, choose ( K ) to be the smallest value so that:

\frac{1}{m} \sum_{i=1}^m \frac{\| x^{(i)} - x_{\text{approx}}^{(i)} \|^2}{\| x^{(i)} \|^2} \leq 0.01 \tag{1}

99% of variance is retained.

Algorithm:

Try PCA with ( K = 1 ):
- Compute variance: $\frac{c_1^2}{\sigma^2}, \frac{c_2^2}{\sigma^2}, \dots, \frac{c_n^2}{\sigma^2}$

Efficient Code:

Given Singular Value Decomposition (SVD):

[u, s, v] = \text{svd}(\Sigma)

Where ( \Sigma ) is:

\begin{bmatrix} s_{11} & s_{12} & \cdots & s_{1n} \\ s_{21} & \cdots & \cdots & \cdots \\ \vdots & \vdots & \vdots & \vdots \\ s_{n1} & \cdots & \cdots & s_{nn} \end{bmatrix}

For given ( K ):

Check: $\frac{\sum_{i=1}^K s_{ii}}{\sum_{i=1}^n s_{ii}} \leq 0.01$

Select ( K ) accordingly.

Page 42 of 53

Reconstruction from Compressed Representation

The compressed representation ( Z ) is given by:

Z = U_{\text{reduce}}^T x

Reconstruction formula:

x_{\text{approx}} = U_{\text{reduce}} Z

Advice for Implementation:

For a supervised learning dataset:
$\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}$

Note:

Mapping ( x^{(i)} \to z^{(i)} ) should be derived by running PCA only on the training set.

Page 43 of 53

III. Anomaly Detection

Problem:

Given dataset:

\{ x^{(1)}, x^{(2)}, \dots, x^{(w)} \}

Is ( x_{\text{test}} ) anomalous?

Decision Rule:

Compute:
$p(x_{\text{test}})$
If:
$p(x_{\text{test}}) < \epsilon$

Gaussian Distribution:

Assume ( x \in \mathbb{R}^n ) is a distributed Gaussian with mean ( \mu ) and variance ( \sigma^2 ).

Probability density function:

p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(x - \mu)^2}{2\sigma^2}}

Where:

\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}

\sigma^2 = \frac{1}{m} \sum_{i=1}^m (x^{(i)} - \mu)^2

Page 44 of 53

Algorithm

Training Set:
$\{ x^{(1)}, x^{(2)}, \dots, x^{(m)} \}$

Evaluation

On a ~~carefully skewed~~ → labeled training set (carefully ensured), training is done and evaluated using:
- True Positives
- False Positives
- False Negatives
- True Negatives
Metrics:
- Precision/Recall
- F1-score

Additionally, use cross-validation to check parameter ( \epsilon ).

Why Not Use Supervised Learning?

Comparison:

Anomaly Detection	Supervised Learning
Very small no. of positive examples (( y = 1 ))	Large positive and negative examples
Large no. of negative examples (( y = 0 ))

Applications:

Anomaly Detection:
- Fraud Detection
- Manufacturing (e.g., aircraft engines)
- Data center monitoring machines
Supervised Learning:
- Email spam classification
- Weather prediction
- Cancer classification

Page 45 of 53

Choosing Good Features to Use

Note: Non-Gaussian features (if included in finding anomalies) will work just fine.
But some transformations can help to make the algorithm work a bit better:

Example:
- ( \text{new} = \log(\text{old}) )
- ( \text{new} = \text{old}^{\frac{1}{2}} )

Also, to detect anomalies, choose features that might take unusually large or small values in the event of an anomaly.

Multivariate Gaussian Distribution

( x \in \mathbb{R}^m ): Don't model ( p(x_1), p(x_2), \ldots ) etc. separately.
Model ( p(x) ) all at once.
Parameters:
- ( \mu \in \mathbb{R}^m ): Mean vector
- ( \Sigma \in \mathbb{R}^{m \times m} ): Covariance matrix

The probability density function is given by:

p(x; \mu, \Sigma) = \frac{1}{(2 \pi)^{\frac{m}{2}} |\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)\right)

Visual representation:

Ellipses based on covariance matrix ( \Sigma = \begin{bmatrix} 0.6 & 0.9 \ 0.9 & 1.0 \end{bmatrix} )
Used to create different contours (curves) for anomalies.

Page 46 of 53

Special Applications

IV. Recommender Systems

These are algorithms aimed at suggesting relevant items to users.

Problem:

Given ( n ) users and ( m ) movies, out of which some are rated by users, develop an algorithm to generate ratings and recommend movies to users.

a) Content-Based Recommender Systems

Example Matrix:

Movie	( U_1 )	( U_2 )	( U_3 )	( U_4 )	Genre 1	Genre 2
( M_1 )	5	5	0	0	0.29	0.01
( M_2 )	5	?	?	?	0.99	0.0
( M_3 )	?	4	?	5	1.0	1.0
( M_4 )	0	0	5	?	0	0.9

For each user ( j ), learn a parameter ( \theta^{(j)} \in \mathbb{R}^{n+1} ).

User ( j ) assigns a rating to movie ( i ) with ( \theta^{(j)} \cdot x^{(i)} ), where ( x^{(i)} ) is the feature vector of movie ( i ).

Cost Function:

J(\theta) = \frac{1}{2m^{(i)}} \sum_{i=1}^{m^{(i)}} \left( \theta^{(j)^T} x^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (\theta_k^{(j)})^2

Page 47 of 53

Generalization:

To learn ( \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(m)} ):

\min_{\theta} \frac{1}{2} \sum_{i:j(i)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (\theta_k^{(j)})^2

Gradient Descent Update:

\theta_k^{(i)} := \theta_k^{(i)} - \alpha \left( \sum_{i:j(i)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i)} \right)x_k^{(i)} \right) + \lambda \theta_k^{(i)}

For ( k \neq 0 ):

\theta_k^{(i)} := \theta_k^{(i)} - \alpha \left( \text{above term with regularization} \right)

b) Collaborative Filtering (Low-Rank Matrix Factorization)

Features in the content are difficult to find, e.g., YouTube.
If we happen to know ( \theta^{(i)} ) for every user ( j ), we can use that to find ( x^{(i)} ).

Objective:

Given ( \theta^{(i)}, \theta^{(j)} ), to learn ( x^{(i)} ):

\min_{\theta} \frac{1}{2m} \sum_{i=1}^{m} \left( (\theta^{(j)})^T x^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (\theta_k^{(j)})^2

Page 48 of 53

Problem Formulation:

Formulate a problem to converge to the most efficient point:
Calls ( \theta \to x \to \theta \to x \to \theta \to x ).

Efficient way: Minimize ( x^{(i)}, \theta^{(j)} ) simultaneously rather than going back and forth.

Cost Function:

J(x^{(i)}, \theta^{(j)}) = \frac{1}{2} \sum_{i:j(i)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (\theta_k^{(j)})^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (x_k^{(i)})^2

Algorithm:

Initialize ( x^{(i)}, \theta^{(j)} ) to small random values.
Minimize ( J(x^{(i)}, \theta^{(j)}) ) using gradient descent (or advanced optimization algorithms): $x_k^{(i)} := x_k^{(i)} - \alpha \left( \sum_{i:j(i)=1} \left( (\theta^{(j)})^T x^{(i)} - y^{(i)} \right)\theta_k^{(j)} \right) + \lambda x_k^{(i)}$

Page 49 of 53

Large Scale Machine Learning

Let's say ( m = 10^8 ).

Now, every gradient descent update may be computationally expensive (Batch gradient descent).

Approach 1:

Let's try on ( m = 10^3 ) and plot error vs ( m ).

\text{Train}(\theta) \quad \text{vs} \quad m

Below are the observations:

High variance: Increase in ( m )
High bias: Increasing ( m ) might not help.

Approach 2: Stochastic Gradient Descent

Cost function:

\text{cost}(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2} \big( h_\theta(x^{(i)}) - y^{(i)} \big)^2

Training cost:

\text{Train}(\theta) = \frac{1}{m} \sum_{i=1}^{m} \text{cost}(\theta, (x^{(i)}, y^{(i)}))

Steps:

Randomly shuffle the dataset.

Repeat:

for i = 1, ..., m {
    θ_j := θ_j - α * ( h_θ(x^(i)) - y^(i) ) * x^(i)_j
    for (j = 0, ..., n)
}

Page 50 of 53

It doesn't just get to the global minimum, but pretty close to the hypothesis, which actually is a good model in the practical world.

Approach 3: Mini-Batch Gradient Descent

Use ( b ) examples in each iteration:
( b = \text{mini-batch size} ) (usually ( b = 2 - 100 ))

Steps:

Repeat {
    for i = 1 to m/b {
        θ_j := θ_j - α * (1/b) * ∑_{k=1}^{b} ( h_θ(x^(k)) - y^(k) ) * x^(k)_j
        (for every j = 0, ..., n)
    }
}

Optimal mini-batch size: ( b = 10 )

Mini-batch works faster than stochastic gradient descent only when good vectorized implementation is there.

Ensuring Stochastic Gradient Descent Convergence:

During learning, compute ( \text{cost}(\theta, (x^{(i)}, y^{(i)})) ) before updating ( \theta ) using ( (x^{(i)}, y^{(i)}) ).
Plot ( \text{cost}(\theta, (x^{(i)}, y^{(i)})) ), averaged over the last 1000 examples processed by the algorithm.
Check if stochastic is converging.

Some options:

Decrease ( \alpha ) slowly w.r.t iterations.

Page 51 of 53

Applications

Online Learning: Learn through a single example or batch of examples and discard them.
- Helps in coping with changes in user preferences.
- Large data stream keeps running.

Approach 4: Map Reduce and Data Parallelism

What about training on multiple machines because the training dataset is too large for a single machine?

Map-Reduce Diagram:

graph TD
    A[Training Set] --> PC1
    A --> PC2
    A --> PC3
    A --> PC4
    PC1 --> B[Combine Results]
    PC2 --> B
    PC3 --> B
    PC4 --> B

Training formula:

\theta_j := \theta_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x^{(i)}) - y^{(i)} \big) \cdot x^{(i)}_j

Divide each PC by 100.

Page 52 of 53

Application Example:

Photo OCR Pipeline

Pipeline Diagram:

graph TD
    A[Photo OCR Pipeline]
    A --> B[Text Detection]
    A --> C[Character Segmentation]
    A --> D[Character Recognition]
    B --> E[Face Recognition Pipeline]
    E --> F[Remove Background]
    F --> G[Face Detection]
    G --> H[Segmentation]
    H --> I[Logistic Regression]
    I --> J[Label]

Page 53 of 53

Reinforcement Learning

Instead of (input, correct output),
we get (input, some output, grade for this output).

Application:

Playing games

References & Related Topics

Gradient Descent Algorithm
Multiple Linear & Polynomial Regression
Logistic Regression
Convex Optimization
Multi-Class Classification
Neural Network Architecture
SVMs
Anomaly Detection, Recommender Systems

Share via EmailShare on XShare on LinkedInShare on Reddit