Page 1 of 53
Machine Learning
A computer program is said to learn from experience ( E )
with respect to some task ( T ) and some performance measure ( P )
if its performance on ( T ), measured by ( P ), improves with experience ( E ).
Types of Machine Learning
Supervised vs Unsupervised
-
Supervised
- Each training example has the associated input & the desired output.
- Examples:
- Classification algorithms (predict discrete values).
- Regression algorithms (predict continuous values).
-
Unsupervised
- Each training example has only the input.
- Used to uncover hidden structure & relations between training examples.
- Examples:
- Clustering algorithms.
There are other types of learning, e.g., Reinforcement Learning (Recommendation systems).
Supervised Learning
Linear Regression
- Training Set
( \text{Input} \rightarrow \text{Learning Algorithm} \rightarrow \text{Output (Hypothesis)} )
Single Input ( x )
Hypothesis:
(Univariate linear regression)
Page 2 of 53
Parameters of the Model
- Need to choose ( \theta_0, \theta_1 ).
- Minimize the cost function (penalty):
Where:
- ( m ) = Number of training examples.
- ( h_{\theta}(x) ) = Predicted value.
- ( y ) = Actual output.
Intuition
( \theta_0 = 0 ), ( \theta_1 = 1 ):
- ( h_{\theta}(x) ) is linear.
( \theta_1 < 1 ):
- ( h_{\theta}(x) ) slope reduces.
( \theta_1 > 1 ):
- ( h_{\theta}(x) ) slope increases.
3D Plot:
- ( J(\theta_0, \theta_1) ) forms a convex quadratic function.
- Sweet spot at minimum ( J(\theta_0, \theta_1) ).
Page 3 of 53
Contour Plots
- Two-dimensional input giving one-dimensional output.
- Representing 3D figure into 2D.
Steps:
- Slice the 3D figure into multiple lines.
- Squash all the contour lines on the X-Y plane (giving the same function).
Gradient Descent Algorithm
To find the "sweet spot," use Gradient Descent algorithm:
- Start with some ( \theta_0, \theta_1, \dots ).
- Keep changing ( \theta_0, \theta_1, \dots ) to reduce ( J(\theta_0, \theta_1) ).
- End up at a minimum.
Algorithm:
Where:
- ( \alpha ) = Learning rate (step to take in some direction).
It is simultaneous update of ( \theta_0, \dots \theta_n ).
Intuition
Positive Slope:
- ( \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) )
- Move down the slope.
Negative Slope:
- ( \theta_1 := \theta_1 + \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) )
- Move up the slope.
Page 4 of 53
Linear Regression Model
Cost Function:
Partial Derivatives:
- For ( \theta_0 ):
- For ( \theta_1 ):
Gradient Descent always converges to global optimum for linear regression as ( J(\theta) ) is always a bowl-shaped function.
Linear Regression with Multiple Variables
Hypothesis:
Let ( x_0 = 1 ):
Where:
- ( X ) = [( x_0, x_1, \dots x_n )]
- ( \theta ) = [( \theta_0, \theta_1, \dots \theta_n )]
Page 5 of 53
Parameters and Cost Function
Parameters:
Cost Function:
Gradient Descent
Formula:
Repeat:
Simultaneously for (j = 0, 1, \dots, n).
Practical Implementation of Gradient Descent
a) Feature Scaling:
- If we have multiple features, making sure that all features are on the same scale makes convergence faster and better.
Example:
- ( x_1 ): ( (0-2000) ), ( x_2 ): ( (1-5) ) → Skewed
- ( x_1 ): ( (0-1) ), ( x_2 ): ( (0-1) ) → Less Skewed
Diagram:
graph LR
WithoutFeatureScaling --> Skewed
WithFeatureScaling --> LessSkewed
Method:
To get every feature into approximately a ( -1 \leq x \leq 1 ) range: Where:
- ( \mu_i ) → Mean of ( x_i )
- ( s_i ) → Range of values (( \text{max} - \text{min} )) or Standard Deviation
b) Learning Rate:
- ( \alpha ) should not be too small or too large.
- Convergence Test: ( J(\theta) ) should decrease after some iterations.
Diagram:
graph LR
JTheta --> NumberOfIterations
SmallAlpha --> SlowConvergence
LargeAlpha --> Divergence
Page 6 of 53
Polynomial Regression
Example:
Let’s take an example of housing price for linear regression:
Where:
- ( x_3 = \text{frontage} \times \text{depth} )
For quadratic polynomial:
This idea is called Polynomial Regression.
- Feature scaling becomes very important in Polynomial Regression.
Alternative to Gradient Descent: Normal Equations
Minimize ( J(\theta) ):
After deriving all, we get:
Complexity:
- ( O(n^{12}) ) for ( n = 10^6 )
Why not Normal Equations?
- Doesn’t work fast if there are large number of features (( n > 10^6 )).
- ( X^T X ) may (or may not) be invertible/singular:
Causes:
- Redundant Features (( x_1 = x_2 )).
- Too many features (( m \leq n )) → Requires Regularization.
Page 7 of 53
Classification Problems
Definition:
Predict ( y ) for discrete values.
Examples:
- Email: Spam / Not Spam
- Transactions: Fraud (Yes / No)?
Where:
- ( 0 ): "Negative Class"
- ( 1 ): "Positive Class"
Binary Classification Algorithm
Logistic Regression:
Classification:
Where ( g(z) ) is the Sigmoid Logistic Function:
Summary:
This section covers gradient descent, feature scaling, polynomial regression, normal equations, and binary classification with logistic regression.
References & Related Topics:
- Gradient Descent Optimization
- Polynomial Regression Techniques
- Regularization in High-Dimensional Spaces
- Logistic Regression for Binary Classification
Page 9 of 53
Hypothesis Function

We aim to fit parameters ( \theta ) into this function ( g(z) ).
What is ( h_\theta(x) ) and its significance?
It represents the estimated probability that ( y = 1 ) given the input ( x ), i.e.,
The probabilities satisfy:
Prediction Rule:
- Predict ( y = 1 ) if ( h_\theta(x) \geq 0.5 )
( \Rightarrow \theta^Tx \geq 0 ) - Predict ( y = 0 ) if ( h_\theta(x) < 0.5 )
( \Rightarrow \theta^Tx < 0 )
Property of Hypothesis Function:
This hypothesis function creates decision boundaries that separate ( x ) and ( y ), enabling classification.
Page 10 of 53
Cost Function
Linear Regression Cost Function:
For logistic regression, if we use the linear regression cost function:
However, this results in a non-convex cost function ( J(\theta) ), which is undesirable:
graph TD
A[Non-convex Cost Function] --> B[Convex Cost Function]
Logistic Regression Cost Function:
To avoid this, we use the following cost function:
Graphical Representation:
graph TD
C[Case 1 (y=1)] --> D[Cost reduces sharply as h_theta(x) approaches 1]
E[Case 2 (y=0)] --> F[Cost reduces sharply as h_theta(x) approaches 0]
Page 11 of 53
Combining Both Cases
To generalize the cost function for both cases:
This function has the property of convexity and uses maximum likelihood estimation.
Optimization:
To minimize the cost function ( J(\theta) ), we need to fit parameters ( \theta ):
We achieve this using gradient descent.
Page 12 of 53
Gradient Descent
Update Rule:
The parameters ( \theta_j ) are updated iteratively:
Partial Derivatives:
The derivative of ( J(\theta) ) with respect to ( \theta_j ) is:
Substituting ( h_\theta(x) = \frac{1}{1 + e^{-z}} ):
Further simplifications yield:
The iterative update ensures convergence to the optimal parameters.
Page 13 of 53
Mathematical Derivations
Derivative of Cost Function:
Other Optimization Algorithms
- Gradient Descent
- Conjugate Gradient
- BFGS
- L-BFGS
Advantages:
- No need to manually pick a learning rate.
- Often faster than gradient descent.
Disadvantages:
- More complex.
Multi-Class Classification Algorithms
- Examples:
- Email categorization: Tagging work, friends, family.
- Medical diagnosis: Not ill, cold, flu.
Binary Classification Diagram:
graph TD
A[Binary Classification] --> B[Decision Boundary]
B --> C[x1]
B --> D[x2]
D --> E[Points classified as class A and class B]
Page 14 of 53
Multi-Class Classification: Breaking into Sub-Problems
For multi-class classification, create new problems by separating different sets of data:
Diagrams:
Example 1:
graph TD
A[x1] --> B[Class A]
A --> C[Class B]
A --> D[Class C]
Example 2:
graph TD
A[x1] --> B[Class 1]
A --> C[Class 2]
A --> D[Class 3]
Summarized Approach:
Train a logistic regression classifier ( h_\theta^{(i)}(x) ) for each class ( i = 1, 2, ..., k ). Predict the probability that ( y = i ).
On a new input ( x ), make a prediction by picking the class:
Page 15 of 53
Regularization
I. Problem of Over-Fitting
For linear regression:
- Underfit (High Bias):
- Overfit (High Variance):
For logistic regression:
- Underfit (High Bias):
II. Addressing Over-Fitting
-
Reduce Number of Features
- Manually select which features to keep.
- Use model selection algorithms.
-
Regularization
- Keep all the features, but reduce the magnitude of parameters ( \theta_j ).
Page 16 of 53
Regularization: Cost Function
For linear regression:
Where ( \lambda ) is the regularization parameter.
Gradient Descent:
Repeat:
Where ( \alpha ) is the learning rate.
Normal Equation:
If ( \lambda > 0 ), then the above matrix is invertible.
Page 17 of 53
For Logistic Regression
The cost function ( J(\theta) ) for logistic regression:
Gradient Descent
Gradient descent is used for optimization, and is same as regularized linear regression.
Neural Networks
Why?
Consider a regression/classification problem with a large number of features ( n > 1000 ).
This might blow feature space as total number of possible combinations for all cubic terms is ( \binom{n}{3}^3 ).
Neural Networks: Algorithms that attempt to mimic the brain's structure.
Diagram:
Illustration of a neuron in the brain:
%% Placeholder for neuron structure diagram %% Diagram type: Architecture Diagram (INSERT IMAGE)
Page 18 of 53
Representation
Neuron Model
A logistic unit is represented as:
- Bias Unit ( x_0 = 1 )
Diagram of neuron model:
%% Placeholder for neuron model diagram %% Diagram type: Block Diagram (INSERT IMAGE)
Sigmoid (Logistic) Activation Function
The activation function is used for logistic regression and neural networks.
Neural Network (Artificial)
Illustration of a neural network:
%% Placeholder for artificial neural network diagram %% Diagram type: Architecture Diagram (INSERT IMAGE)
Layers
- Input Layer
- Middle Layer
- Output Layer
Notation
- ( a_j^{(i)} ): "activation" of unit ( i ) in layer ( j ).
- ( \theta^{(i)} ): matrix of weights controlling function mapping from layer ( j ) to layer ( j+1 ).
Page 19 of 53
Formulas and Dimensions
Activation formulas across layers:
Dimensions of ( \theta^{(i)} ):
- ( \theta^{(i)} ) has dimension ( s_{j+1} \times (s_j + 1) ).
Activation Calculation:
Where:
Page 20 of 53
Algorithm
For Classification:
For ( m ) training examples with ( l )-layers:
( s_j ): number of units (not counting bias unit) in layer ( j ).
1. Binary Classification
- Output Units:
- ( s_{l} = 1 )
- ( k = 1 )
2. Multi-Class Classification
- Output Units:
- ( s_{l} = k \quad (k \geq 3) )
Cost Function
For multi-class classification:
Goal:
Minimize ( J(\theta) ):
Page 21 of 53
Backpropagation Algorithm
(Calculates gradient efficiently with SGD uses)
Definitions
- ( S_j^{(l)} ): "Error" of node ( j ) in layer ( l ).
For layer ( l ):
[
S_j^{(l)} = (a_j^{(l)} - y_j) \cdot h\theta(x_j)
]
Diagram
graph LR
Layer1 --> Layer2
Layer2 --> Layer3
Layer3 --> Output
For Earlier Layers
[ S^{(3)} = (\theta^{(4)})^T \cdot S^{(4)} \cdot g'(z^{(3)}) \quad \text{where} \quad g'(z^{(3)}) = a^{(3)} \cdot (1 - a^{(3)}) ]
[ S^{(2)} = (\theta^{(3)})^T \cdot S^{(3)} \cdot g'(z^{(2)}) ]
Additionally: [ S_j^{(2)} \cdot \delta_j^{(2)} = \frac{\partial z_j^{(2)}}{\partial \theta_j^{(2)}} \quad \text{(at ( i ))}. ]
If ( \lambda = 0 ): [ \frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)}} = a_j^{(l)} \cdot S_j^{(l+1)}. ]
Activation Function Derivative
[ g'(z^{(2)}) = g(z^{(2)}) \cdot (1 - g(z^{(2)})) = a^{(2)} \cdot (1 - a^{(2)}). ]
Page 22 of 53
Backpropagation (Again)
Definition
It is an algorithm for calculating the gradient for one training example.
Example
- Consider one training example.
- Recognize a 2 in the last layer.
Diagram
graph LR
Layer1 --> Output
Output[Recognizes 0, 2]
What Can We Do?
- Increase ( b ):
- Increase ( w_i ): In proportion to ( a_i ).
- Change ( a_i ):
Can't directly influence this one→ Corrected: Cannot directly influence this one.
Now, all desires of output neurons are added to previous weights & biases.
Apply it recursively → This is called backpropagation.
Practical Note:
In practice, it takes a long time to add up influences for every training example for every gradient.
Page 23 of 53
Stochastic Gradient Descent (SGD)
Key Point
SGD takes mini-batches and computes each step using batch gradient descent.
Gradient Descent
- Gradient descent: Will average all gradients for every training example.
Onto Calculus
Cost Function:
[
C_0 = (a^{(L)} - y)^{2}
]
Equations:
[
z^{(L)} = \theta^{(L)} a^{(L-1)} + b^{(L)}.
]
[ a^{(L)} = \sigma(z^{(L)}). ]
Diagram
graph TD
Input --> HiddenLayer1
HiddenLayer1 --> HiddenLayer2
HiddenLayer2 --> Output
Change in Cost w.r.t. Weights: [ \frac{\partial C_0}{\partial \theta^{(L)}} = \sum_{k=1}^{K} a^{(L-1)} \cdot \sigma'(z^{(L)}) \cdot [2(a^{(L)} - y)]. ]
Change in Cost w.r.t. Bias: [ \frac{\partial C_0}{\partial b} = 1 \cdot \sigma'(z^{(L)}) \cdot [2(a^{(L)} - y)]. ]
Page 24 of 53
New: Propagating Backwards
Backpropagation Algorithm (Summarized)
-
Training Set:
[ { (x^{(m)}, y^{(m)}), \ldots (x^{(m)}, y^{(m)}) }. ] -
Initialize: [ \Delta_{ij}^{(l)} = 0 \quad \text{(for all ( l, i, j ))}. ]
-
For ( i = 1 ) to ( m ):
- Set ( a^{(1)} = x^{(i)} ).
- Perform forward propagation to compute ( a^{(L)} ) for ( l = 2, 3, \ldots, L ).
- Compute ( C_0 = a^{(L)} - y^{(i)} ).
- Compute ( S^{(l)} ): [ S^{(l)} = a^{(l)} \cdot \sigma'(z^{(l)}). ]
-
Update: [ \Delta_{ij}^{(l)} = \Delta_{ij}^{(l)} + a_i^{(l)} \cdot S_j^{(l+1)}. ]
-
Compute Gradients:
- If ( j \neq 0 ): [ \frac{\partial J}{\partial \theta_{ij}^{(l)}} = \frac{1}{m} \Delta_{ij}^{(l)} + \lambda \theta_{ij}^{(l)}. ]
- If ( j = 0 ) (Bias unit): [ \frac{\partial J}{\partial \theta_{ij}^{(l)}} = \frac{1}{m} \Delta_{ij}^{(l)}. ]
Page 25 of 53
Implementation Notes
-
Matrix to Vector Conversion
- You can convert matrices to vectors and small streams as suited for needs.
-
Gradient Checking
- You can do gradient checking which can confirm if backpropagation is working or not: for ( i = 1 ) to ( n ).
Putting It All Together
-
Select a network architecture:
- Number of input units → Dimension of features ( n^{(i)} )
- Number of output units → Number of classes ( k )
- Number of hidden units → ( l ) (or more), and have same number of hidden units in every layer.
-
Randomly initialize weights
-
Implement forward propagation to get ( h_\Theta^{(i)} ) for any ( i ).
-
Implement code to compute cost function ( J(\Theta) ).
-
Implement backpropagation to compute partial derivatives ( \frac{\partial J}{\partial \Theta_{ij}^{(l)}} )
- For ( i = 1 ) to ( m ):
Perform forward and backpropagation using ( x^{(i)}, y^{(i)} ).
Get activations ( a^{(l)}, z^{(l)} ), and delta terms ( \delta^{(l)} ) for ( l = 2, 3, \dots, L ).
- For ( i = 1 ) to ( m ):
-
Use gradient checking to compare gradients, then disable gradient checking.
-
Use gradient descent or advanced optimization method with backpropagation to try to minimize ( J(\Theta) ) as a function of parameters ( \Theta ).
Page 26 of 53
Advice on How to Apply ML Algorithms
Debugging a Learning Algorithm
- Example:
Suppose for a problem (housing prices), we have implemented regularized linear regression.
However, when we test our hypothesis on a new set of houses, it makes unacceptably large errors in its predictions.
Options to Fix Them:
| Solution | Answer |
|---|---|
| Get more training examples | Fix high variance |
| Try smaller set of features | Fix high variance |
| Try getting additional features | Fix high bias |
| Try adding polynomial features | Fix high bias |
| Try decreasing ( \lambda ) | Fix high bias |
| Try increasing ( \lambda ) | Fix high variance |
- These may take a long time, so we use something called Machine Learning Diagnostic.
Evaluating a Hypothesis
-
Divide dataset into examples:
Randomized split:- Training set: ( 70% )
- Test set: ( 30% )
-
Learn parameters ( \Theta ) from training data
-
Compute test set error:
Page 27 of 53
Dividing Polynomial Parameters and Lambda (Model Selection)
Consider 10 Hypotheses:
- ( h_\Theta(x) = \Theta_0 + \Theta_1x )
- ( h_\Theta(x) = \Theta_0 + \Theta_1x + \Theta_2x^2 )
... - ( h_\Theta(x) = \Theta_0 + \Theta_1x + \Theta_2x^2 + \dots + \Theta_{10}x^{10} )
- If 5th hypothesis is selected, this will not generalize our model as it is practical evaluation only on test set.
Alternative Method:
-
Divide dataset into 3 parts:
- Training
- Cross-validation
- Test
-
Find all three errors.
-
Run all hypotheses (( h_\Theta )):
Pick lowest ( J(\Theta) ) for cross-validation set. -
Estimate generalization error for test set ( J_{test}(\Theta^{(l)}) ).
Page 28 of 53
IV. Diagnosing Bias vs Variance
Training Error:
Cross Validation Error:
Graphs:
flowchart TD
A[Bias (High Bias)] -->|Underfit| B[Error ↑]
C[Variance (High Variance)] -->|Overfit| D[Error ↑]
- Error vs Degree of Polynomial (d):
- High Bias: Both training error ( J_{\text{train}}(\theta) ) and cross-validation error ( J_{\text{CV}}(\theta) ) are high.
- High Variance: ( J_{\text{train}}(\theta) ) is low, but ( J_{\text{CV}}(\theta) ) is high.
If Model Suffers From Bias:
- ( J_{\text{train}}(\theta) ) ↑
- ( J_{\text{CV}}(\theta) ) ↑
- ( J_{\text{CV}}(\theta) \approx J_{\text{train}}(\theta) )
Else If Model Suffers From Variance:
- ( J_{\text{train}}(\theta) ) ↓
- ( J_{\text{CV}}(\theta) ) ↑
- ( J_{\text{CV}}(\theta) \gg J_{\text{train}}(\theta) )
Page 29 of 53
V. Regularization & Bias-Variance
Model:
Cost Function:
Bias-Variance Tradeoff:
- Large (\lambda): High Bias (Underfit)
- Intermediate (\lambda): Just Right
- Small (\lambda): High Variance (Overfit)
Example:
- (\lambda = 1000): ( \theta_1 \approx 0, \theta_2 \approx 0 )
- ( h_\theta(x) \approx \theta_0 )
Choosing Good Value for (\lambda):
- Experiment with values: (0, 0.01, 0.02, 0.04, \dots)
- Multiply by 2 → Calculate ( J_{\text{CV}}(\theta) )
Graph:
flowchart TD
A[High Bias] -->|Small Lambda| B[High Variance]
B -->|Optimal Lambda| C[Minimized Error]
Page 30 of 53
VI. Learning Curves
Training Error:
Cross Validation Error:
Graphs:
flowchart TD
A[High Bias] -->|No improvement| B[Error stays high]
C[High Variance] -->|Training data helps| D[Error reduces]
-
Learning Algorithm Suffers From High Bias:
- Will not perform well even with more training data.
-
Learning Algorithm Suffers From High Variance:
- Getting more training data is likely to help.
Page 31 of 53
VII. Selecting Neural Networks
Small Neural Networks:
- Fewer parameters
- Computationally cheaper
- More prone to underfitting
Large Neural Networks:
- More parameters
- Computationally expensive
- More prone to overfitting
- Use regularization to mitigate overfitting.
Page 32 of 53
Notes:
- Other than these algorithms, there are many more.
- Project will depend on data & capabilities.
VIII. Support Vector Machines (SVM)
Alternate View of Logistic Regression:
Cases:
- If ( y = 1 ), we want ( h_\theta(x) \approx 1 ), ( \theta^Tx \geq 0 )
- If ( y = 0 ), we want ( h_\theta(x) \approx 0 ), ( \theta^Tx \leq 0 )
Cost Function:
SVM Cost Function:
Graphs:
flowchart TD
A[Cost1(z)] -->|Case 1| B[Cost0(z)]
B -->|Regularization| C[New Cost Function]
Page 33 of 53
SVM Cost Function and Parameters
Cost Function
The cost function ( J(\theta) ) is defined as:
In another trade-off scenario:
Adding regularization parameters:
Hypothesis
The hypothesis ( h_\theta(x) ) is defined as:
Notes:
- If ( y=1 ), we want ( \theta^Tx \geq 1 ) (not just ( > 0 )).
- Or some cutoff value.
- If ( y=0 ), we want ( \theta^Tx \leq -1 ) (not just ( < 0 )).
Safety Factor in SVM
- This allows SVM to incorporate a safety factor into the model.
- Mathematically, it enables SVM to achieve a better decision boundary (large margin).
- Hence, it's called Large Margin Classifier.
Important Note:
- In cases of outliers, ( C ) must not be too large for the large margin classifier.
Page 34 of 53
Kernels in SVM
Introduction to Kernels
Given ( x_1 ), compute new features depending on proximity to landmarks ( l^{(1)}, l^{(2)}, \dots, l^{(m)} ).
Gaussian Kernel
For a given ( x ):
- ( f_1 = \text{similarity}(x, l^{(1)}) = \exp\left(-\frac{|x - l^{(1)}|^2}{2\sigma^2}\right) )
- ( f_2 = \text{similarity}(x, l^{(2)}) )
- ( f_3 = \text{similarity}(x, l^{(3)}) )
Kernels (Gaussian Kernel):
SVM with Kernel
- Given: ( x^{(1)}, x^{(2)}, \dots, x^{(m)} ) and ( y^{(1)}, y^{(2)}, \dots, y^{(m)} ).
- Choose landmarks: ( l^{(1)}, l^{(2)}, \dots, l^{(m)} ).
- Compute:
- ( f_1 = \text{similarity}(x, l^{(1)}) )
- ( f_2 = \text{similarity}(x, l^{(2)}) ), etc.
- Predict:
- ( y = 1 ) if ( \theta^T f > 0 ).
- Training:
Page 35 of 53
SVM Parameters and Implementation Notes
Kernel Limitations
- Kernels don't perform well with logistic regression due to computational overhead (it's slow in logistic regression).
SVM Parameters:
- C: Trade-off parameter
- Large ( C ):
- Lower Bias, High Variance.
- Small ( C ):
- Higher Bias, Low Variance.
Variance and Bias:
- Large ( \sigma^2 ):
- Features are very smooth, higher bias, lower variance.
- Small ( \sigma^2 ):
- Lower bias, higher variance.
Implementation Notes:
- Case I: ( n ) large relative to ( m ):
- ( n = 10,000 ), ( m = 10-1000 ).
- Use logistic regression or SVM without kernel (efficient).
- Case II: ( n ) small, ( m ) intermediate:
- ( n = 1-1000 ), ( m = 10-10,000 ).
- Use SVM with Gaussian kernel.
- Case III: ( n ) small, ( m ) large:
- ( n = 1-1000 ), ( m = 50,000 ).
- Create/add more features, then use logistic regression or SVM without kernel.
Note:
- Neural networks are likely to work well for most settings but may be slower to train.
Page 36 of 53
Unsupervised Learning
Clustering:
- Training Set: ( x_1, x_2, \dots, x_m )
Algorithm: K-Means
- Input:
- ( K ) (Number of Clusters) → To be decided.
- Training Set: ( x^{(1)}, x^{(2)}, \dots, x^{(m)} ).
- Algorithm Steps:
- ( x^{(1)} \in \mathbb{R}^n ) (Assume ( x_0 = 1 ) convention).
- Randomly Initialize: ( K ) cluster centroids ( \mu_1, \mu_2, \dots, \mu_K \in \mathbb{R}^n ).
Page 37 of 53
K-Means Algorithm
Algorithm Steps
- Repeat:
-
For (i = 1) to (m):
- ( c^{(i)} ): Index (from 1 to (k)) of cluster centroid closest to (x^{(i)}).
-
For (k = 1) to (K):
- ( \mu_k ): Average (mean) of points assigned to cluster (k).
-
Other Applications of K-Means
- Non-separated clusters:
- Example: T-shirt sizing based on height and weight.
- Clusters: Small (S), Medium (M), Large (L).
- Choose (k = 3).
graph TD
A[Height] --> B[Small (S)]
A --> C[Medium (M)]
A --> D[Large (L)]
Optimization Objective
-
Cluster Assignment:
- ( c^{(i)} ): Index of cluster ((1, 2, \ldots, k)) to which example (x^{(i)}) is currently assigned.
- ( \mu_{c^{(i)}} ): Cluster centroid of cluster to which example (x^{(i)}) has been assigned.
-
Mathematical Formulation:
Page 38 of 53
Random Initialization
-
Guidelines:
- (k < m).
- Randomly pick (k) training examples.
- Set (\mu_1, \ldots, \mu_k) equal to these (k) examples.
-
Initialization:
- Multiple runs to avoid bad luck.
- Example:
- Run 1 → 100 iterations.
- Randomly initialize (k)-means.
- Compute cost function.
- Choose initialization with lowest cost function.
Dimensionality Reduction
1. Why Dimensionality Reduction?
i) Data Compression:
- Objective: Reduce data dimensions (e.g., 2D → 1D).
- Example:
- (x_1) → height (cm).
- (x_2) → weight (kg).
graph TD
A[Data (2D: Height & Weight)] --> B[Compressed Data (1D)]
ii) Data Visualization:
- Problem:
- (x^{(i)} \in \mathbb{R}^n) is not helpful.
- Reduce data from (50D) to (2D).
- Result: (z^{(i)} \in \mathbb{R}^2).
Algorithm: Principal Component Analysis (PCA)
Problem:
- Reduce (2D \to 1D).
- Find a direction ((c)-vector) onto which to project the data to minimize projection error.
graph TD
A[Original Data (2D)] --> B[Projected Data (1D)]
- Generalized Idea:
- Reduce (nD \to kD).
- Find (k) vectors onto which to project the data to minimize projection error.
Page 39 of 53
Data Preprocessing
Training Set:
- (x^{(1)}, x^{(2)}, \ldots, x^{(m)}).
- Feature Scaling & Mean Normalization:
- Mean:
Note:
- Different features on different scales (e.g., number of bedrooms vs. area).
- Scale features to have comparable ranges.
PCA Steps
1. Compute Covariance Matrix:
2. Compute Eigen Vectors:
- Use eigen decomposition to find vectors:
3. Reduce Dimensions:
- Select top (k) eigenvectors:
4. Project Data:
Choosing (k) (Number of Dimensions)
Metrics:
-
Average squared projection error:
Page 41 of 53
Dimensionality Reduction (PCA)
Key Idea:
Typically, choose ( K ) to be the smallest value so that:
99% of variance is retained.
Algorithm:
-
Try PCA with ( K = 1 ):
- Compute variance:
Efficient Code:
Given Singular Value Decomposition (SVD):
Where ( \Sigma ) is:
For given ( K ):
- Check:
Select ( K ) accordingly.
Page 42 of 53
Reconstruction from Compressed Representation
The compressed representation ( Z ) is given by:
Reconstruction formula:
Advice for Implementation:
-
For a supervised learning dataset:
Note:
Mapping ( x^{(i)} \to z^{(i)} ) should be derived by running PCA only on the training set.
Page 43 of 53
III. Anomaly Detection
Problem:
Given dataset:
Is ( x_{\text{test}} ) anomalous?
Decision Rule:
-
Compute:
-
If:
Gaussian Distribution:
Assume ( x \in \mathbb{R}^n ) is a distributed Gaussian with mean ( \mu ) and variance ( \sigma^2 ).
Probability density function:
Where:
Page 44 of 53
Algorithm
-
Training Set:
Evaluation
-
On a
carefully skewed→ labeled training set (carefully ensured), training is done and evaluated using:- True Positives
- False Positives
- False Negatives
- True Negatives
-
Metrics:
- Precision/Recall
- F1-score
Additionally, use cross-validation to check parameter ( \epsilon ).
Why Not Use Supervised Learning?
Comparison:
| Anomaly Detection | Supervised Learning |
|---|---|
| Very small no. of positive examples (( y = 1 )) | Large positive and negative examples |
| Large no. of negative examples (( y = 0 )) |
Applications:
- Anomaly Detection:
- Fraud Detection
- Manufacturing (e.g., aircraft engines)
- Data center monitoring machines
- Supervised Learning:
- Email spam classification
- Weather prediction
- Cancer classification
Page 45 of 53
Choosing Good Features to Use
Note: Non-Gaussian features (if included in finding anomalies) will work just fine.
But some transformations can help to make the algorithm work a bit better:
- Example:
- ( \text{new} = \log(\text{old}) )
- ( \text{new} = \text{old}^{\frac{1}{2}} )
Also, to detect anomalies, choose features that might take unusually large or small values in the event of an anomaly.
Multivariate Gaussian Distribution
- ( x \in \mathbb{R}^m ): Don't model ( p(x_1), p(x_2), \ldots ) etc. separately.
- Model ( p(x) ) all at once.
- Parameters:
- ( \mu \in \mathbb{R}^m ): Mean vector
- ( \Sigma \in \mathbb{R}^{m \times m} ): Covariance matrix
The probability density function is given by:
Visual representation:
- Ellipses based on covariance matrix ( \Sigma = \begin{bmatrix} 0.6 & 0.9 \ 0.9 & 1.0 \end{bmatrix} )
- Used to create different contours (curves) for anomalies.
Page 46 of 53
Special Applications
IV. Recommender Systems
These are algorithms aimed at suggesting relevant items to users.
Problem:
Given ( n ) users and ( m ) movies, out of which some are rated by users, develop an algorithm to generate ratings and recommend movies to users.
a) Content-Based Recommender Systems
Example Matrix:
| Movie | ( U_1 ) | ( U_2 ) | ( U_3 ) | ( U_4 ) | Genre 1 | Genre 2 |
|---|---|---|---|---|---|---|
| ( M_1 ) | 5 | 5 | 0 | 0 | 0.29 | 0.01 |
| ( M_2 ) | 5 | ? | ? | ? | 0.99 | 0.0 |
| ( M_3 ) | ? | 4 | ? | 5 | 1.0 | 1.0 |
| ( M_4 ) | 0 | 0 | 5 | ? | 0 | 0.9 |
For each user ( j ), learn a parameter ( \theta^{(j)} \in \mathbb{R}^{n+1} ).
User ( j ) assigns a rating to movie ( i ) with ( \theta^{(j)} \cdot x^{(i)} ), where ( x^{(i)} ) is the feature vector of movie ( i ).
Cost Function:
Page 47 of 53
Generalization:
To learn ( \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(m)} ):
Gradient Descent Update:
For ( k \neq 0 ):
b) Collaborative Filtering (Low-Rank Matrix Factorization)
Features in the content are difficult to find, e.g., YouTube.
If we happen to know ( \theta^{(i)} ) for every user ( j ), we can use that to find ( x^{(i)} ).
Objective:
Given ( \theta^{(i)}, \theta^{(j)} ), to learn ( x^{(i)} ):
Page 48 of 53
Problem Formulation:
Formulate a problem to converge to the most efficient point:
Calls ( \theta \to x \to \theta \to x \to \theta \to x ).
Efficient way: Minimize ( x^{(i)}, \theta^{(j)} ) simultaneously rather than going back and forth.
Cost Function:
Algorithm:
- Initialize ( x^{(i)}, \theta^{(j)} ) to small random values.
- Minimize ( J(x^{(i)}, \theta^{(j)}) ) using gradient descent (or advanced optimization algorithms):
Page 49 of 53
Large Scale Machine Learning
Let's say ( m = 10^8 ).
Now, every gradient descent update may be computationally expensive (Batch gradient descent).
Approach 1:
Let's try on ( m = 10^3 ) and plot error vs ( m ).
Below are the observations:
- High variance: Increase in ( m )
- High bias: Increasing ( m ) might not help.
Approach 2: Stochastic Gradient Descent
Cost function:
Training cost:
Steps:
- Randomly shuffle the dataset.
- Repeat:
for i = 1, ..., m { θ_j := θ_j - α * ( h_θ(x^(i)) - y^(i) ) * x^(i)_j for (j = 0, ..., n) }
Page 50 of 53
It doesn't just get to the global minimum, but pretty close to the hypothesis, which actually is a good model in the practical world.
Approach 3: Mini-Batch Gradient Descent
Use ( b ) examples in each iteration:
( b = \text{mini-batch size} ) (usually ( b = 2 - 100 ))
Steps:
Repeat {
for i = 1 to m/b {
θ_j := θ_j - α * (1/b) * ∑_{k=1}^{b} ( h_θ(x^(k)) - y^(k) ) * x^(k)_j
(for every j = 0, ..., n)
}
}
Optimal mini-batch size: ( b = 10 )
Mini-batch works faster than stochastic gradient descent only when good vectorized implementation is there.
Ensuring Stochastic Gradient Descent Convergence:
- During learning, compute ( \text{cost}(\theta, (x^{(i)}, y^{(i)})) ) before updating ( \theta ) using ( (x^{(i)}, y^{(i)}) ).
- Plot ( \text{cost}(\theta, (x^{(i)}, y^{(i)})) ), averaged over the last 1000 examples processed by the algorithm.
- Check if stochastic is converging.
Some options:
- Decrease ( \alpha ) slowly w.r.t iterations.
Page 51 of 53
Applications
- Online Learning: Learn through a single example or batch of examples and discard them.
- Helps in coping with changes in user preferences.
- Large data stream keeps running.
Approach 4: Map Reduce and Data Parallelism
What about training on multiple machines because the training dataset is too large for a single machine?
Map-Reduce Diagram:
graph TD
A[Training Set] --> PC1
A --> PC2
A --> PC3
A --> PC4
PC1 --> B[Combine Results]
PC2 --> B
PC3 --> B
PC4 --> B
Training formula:
Divide each PC by 100.
Page 52 of 53
Application Example:
Photo OCR Pipeline
Pipeline Diagram:
graph TD
A[Photo OCR Pipeline]
A --> B[Text Detection]
A --> C[Character Segmentation]
A --> D[Character Recognition]
B --> E[Face Recognition Pipeline]
E --> F[Remove Background]
F --> G[Face Detection]
G --> H[Segmentation]
H --> I[Logistic Regression]
I --> J[Label]
Page 53 of 53
Reinforcement Learning
Instead of (input, correct output),
we get (input, some output, grade for this output).
Application:
- Playing games
References & Related Topics
- Gradient Descent Algorithm
- Multiple Linear & Polynomial Regression
- Logistic Regression
- Convex Optimization
- Multi-Class Classification
- Neural Network Architecture
- SVMs
- Anomaly Detection, Recommender Systems