Page 1 of 91
1. Introduction: Computer Vision
- Computer programs
that can intercept images→ that can interpret images. - Understanding the representation of images.
Applications
- Images & movies
- Example: Surveillance
- Building 3D representations → Growth, Movies
- Motion capture → Pirates of Caribbean, CGI
- OCR and Face Recognition
- Scanner → Adobe, Syskonaut
- Object Recognition
- Special effects & 3D Modelling
- Smart Cars
- Sports (Cricket, Soccer)
- Vision-Based Interaction → MS Kinect, Nintendo Wii
- Security and Surveillance
Note:
Image Processing & Computer Vision are different.
Computational Models (Math)
graph TD
A[Computational Models (Math)] --> B[Algorithm]
A --> C[Real Images]
Page 2 of 91
2. Image Processing for CV
- Image is a function of I, from ( R^2 ) to ( R ):
- Usually, image = smooth function.
- ( f(x, y) ) gives the intensity or value at position ( (x, y) ).
- Practically, digital image data is a matrix ( f: [a, b] \times [c, d] \rightarrow [I_{min}, I_{max}] ) with a finite range.
Note:
- 0 to 255 is atomic measurement (0-1 should be better).
For Color Images
-
Image Blending:
( g(n) = (1 - \alpha) f_0(n) + \alpha f_1(n) )
dst = cv2.addWeighted(img1, 0.7, img2, 0.3, 0) -
Pixel = picture element.
Digital Images
- In computer vision, we typically operate on digital images.
- Sample the 2D space on a regular grid.
- Quantize each sample (rounded to the "nearest integer").
- Images are thus represented as a matrix of integers.
Noise in Images
where ( N(x, y) ) = noise.
Types of Noise
- Salt & Pepper: random occurrences of black & white pixels.
- Impulse: random occurrences of white pixels.
- Gaussian: variations in intensity drawn from a Gaussian (normal) distribution.
where (\sigma) depends on the range of image.
output = img + noise
Page 3 of 91
Image Filtering
Used to:
- Enhance images
- Extract information
- Detect patterns
Removing Noise from an Image (Moving Average, 1-D)
Approach:
Replace each pixel with an average of all the values in its neighbourhood.
Assumptions
- The "true" value of pixels are similar to the true values of pixels nearby.
- The noise added to each pixel is done independently.
Weighted Moving Average (1-D)
- Can add weights to our moving average.
- Usually use an odd number of weighted mask.
- Eg: [1, 4, 6, 4, 1] to all nearby neighbourhood.
For 2-D
Moving Average
- Just take some ( k \times k ) or ( 3 \times 3 ) matrix and average it.
Correlation Filtering – Uniform Weights
Say the averaging window size is ( (2k+1) \times (2k+1) ):
(Nested Loop, averaging)
Page 4 of 91
Correlation Filtering – Non-uniform Weights
This is called cross-correlation, denoted ( G = H \bigotimes F ).
- ( H[u, v] ) is called filter "kernel" or "mask", which is just the matrix of weights in the linear combination.
What makes a good kernel?
Approach 2: Gaussian Filter
This kernel is an approximation of the Gaussian function.
Note:
- Variance (( \sigma^2 )) or standard deviation (( \sigma )) determines extent of smoothing.
- Also, size of kernel (( u, v )) is not variance.
Impulse Noise
- In discrete world, impulse is just a value of I at a single location.
Page 5 of 91
While in continuous world, impulse is an idealized function that is very narrow and very tall so that it has a unit area.
graph TD
A[discrete pulse] --> B[continuous impulse]
Convolution
- ( G = H * F )
- Also, ( Img * Impulse = Img )
Properties of Convolution
a) Linear & shift-invariant
b) ( f * g = g * f ) (Commutative)
c) ( (g * f) * h = g * (f * h) ) (Associative)
d) ( f * e = f ) (Identity)
Differentiation:
A Big Note:
What would be the computational complexity?
( W \times N \times N ) to get the convolution (filter * image)
So, if some filter is separable, we can exact convolution by:
( W \times N + N \times N + W \times N + N = 2 W \times N + N \times N \ll W N^2 )
Page 6 of 91
It can be done because
Boundary Issues
What happens when filter falls off the edge?
- Case 1: Create big image
- Case 2: Create same image
- Case 3: Create smaller image
graph TD
A[Big Image] --> B[Filter]
C[Same Image] --> B
D[Small Image] --> B
Methods (in OpenCV) [for same image]:
cv2.BORDER_CONSTANT: just clips out extra & replace with some colorcv2.BORDER_WRAP: circular wrappingcv2.BORDER_REPLICATE: last element is replicated throughoutcv2.BORDER_REFLECT: border will be mirror reflection of the border elements
Page 7 of 91
Now, talking about filter for noises like salt & pepper noise, impulse noise:
Approach 3: Median Filter
- Select a kernel (odd × odd)
- Sort all the numbers
- Select mid element
- Replace it
It is also called edge preserving because:
graph TD
A[Input]
B[Median filter]
C[Mean filter]
A --> B
A --> C
(Mermaid graph approximates the input/median/mean comparison)
Page 8 of 91
Template Matching
Now, we move on to finding properties of pixels locally in the image.
- 1D (nx) Correlation:
%% Sequence Diagram for 1D Correlation
sequenceDiagram
participant Signal
participant Filter
Signal->>Filter: Correlation
Filter->>Signal: Normalized cross-correlation (maximum value)
-
Why maximum value is attained at the point they match completely?
- (w × w) = positive
- (−w × (−w)) = (+w)
- Well, the filters max out each signal when they match exactly.
-
We can now use this trick to find objects in the image (Template Matching).
Limitation:
- This doesn't work well for different scales & orientations. - So it's just an approach of using filters as templates.
Page 9 of 91
Edge Detection: Gradients
- Edges in an image are important; they can easily convey information with reduced images.
- Edges = reduced set of pixels or curves that are important elements of the picture.
- Edges look like steep cliffs/changes, so look for a neighborhood with strong signs of change.
But there is ambiguity:
- Neighborhood size
- How to detect change
Change is derivative (an edge is a place of rapid change in image intensity function):
graph TD
A[Image] --> B[Intensity function]
B --> C[First derivative]
C --> D[Edge corresponds to extreme derivative]
-
To find these peaks (extrema of derivative), we run it through some operators (differentiated operators to be exact).
- Differential operators: when applied to some image, returns some derivatives.
- Model these "operators" as masks/kernels that compute the image gradient function.
- Threshold the gradient function to select the edge pixels.
Page 10 of 91
But, what is gradient?
- Gradient is a vector
that is made up by the partial derivatives→ made up of the partial derivatives.
The gradient of an image:
- In other words, the gradient points in the direction of most rapid increase in intensity & magnitude of vector is how much it's changing as a function of a unit step in that direction.
For computers, we calculate discrete gradient:
Similarly,
Some Operators
- Sobel Operator:
- It is a joint Gaussian smoothing plus differentiation operator; it is more resistant to noise.
Page 11 of 91
Constraints to Edges
Primary Edge Detection Steps
- Smoothing derivatives to suppress noise and compute gradient.
- Threshold to find "regions" of "significant" gradient.
- "Thin" to get localized edge pixels.
- Link or connect edge pixels.
So, to do that, we use the edge detection algorithm. Its stages are:
-
Filter image with derivative of Gaussian.
-
Find magnitude & orientation of gradient.
- Edge gradient (( \alpha )):
Some Concerns
- What is the value of ( \sigma ) in Gaussian filter?
- Canny with ( \sigma = 1 ) (fine features/detailed edge maps)
- Canny with ( \sigma = 2 ) (coarse features/large scale edges)
Page 12 of 91
- What is the kernel size to be taken?
Mathematics
- ( f ) → image function
- ( h ) → filter (Gaussian)
For 1-D
graph TD
A[Noisy Image] --> B[Apply filter h]
- Apply derivative operator:
(noisy derivative signal)
So, we need to remove noise.
graph TD
A[h] --> B[f]
B --> C[Convolution]
C --> D[Peak detection]
Page 13 of 91
Also,
graph TD
A[dh/dn] --> B[f]
B --> C[Convolution]
To find peaks, we take more derivatives:
- Inverted Mexican hat
For 2-D
- Gaussian:
- ( \frac{\partial}{\partial n} ) becomes tedious.
- ( \frac{\partial^2 h}{\partial (n, y)^2} ) is Mexican hat.
Page 14 of 91
- So, we use something called Laplacian operator.
Laplacian Operator
- Centered difference operator (second derivative).
Parametric Models
- It can represent a class of instances where each is defined by a value of the parameters.
- e.g., lines, circles, parameterized template
Fitting a Parametric Model
- Choose a parametric model to represent a set of features.
- Membership criterion is not local.
- Computational complexity is important.
Line Fitting
- It's difficult to find a line just by looking at the points, so we use voting (.,.) (Hough Transform).
- It is a general technique where we let the features vote for all models that are compatible with H.
- Loop through features, each casting vote for model parameters.
- Look for model parameters that receive a lot of votes.
Page 15 of 91
Hough Space
-
Image Space: y = mx + b
-
Hough (parameter) space: b = -x₀m + y₀ (for a point (x₀, y₀))
- This point represents a line.
-
A point in image space is a line in Hough space.
-
A line in the image corresponds to a Hough space point.
graph TD
A[Image Space (points)] --> B[Hough Space (lines)]
C[Hough Space (point)] --> D[Image Space (line)]
Algorithm:
- Let each edge point in image space vote for a set of possible parameters in Hough space.
- Accumulate votes in discrete set of bins; parameters with the most votes indicate line in image space.
A little extra:
-
If line is vertical, m → ∞, so we use polar representation of a line to prevent any mathematical errors.
- d = perpendicular distance from line to origin
- θ: angle the line makes with x-axis
Page 16 of 91
The polar representation of image causes the Hough space to have sinusoidal waves for each point.
Basic Hough Transform Algorithm
- Initialize ( H[d, \theta] = 0 )
- For each edge point in image function ( F ):
- For θ = 0 to π (180):
- ( d = x \cos \theta + y \sin \theta )
- ( H[d, \theta]++ )
- For θ = 0 to π (180):
- Find the value(s) of ( (d, \theta) ) where ( H[d, \theta] ) is max.
- The detected line is given by ( x \cos \theta + y \sin \theta = d ).
Space: O(dθ) → Time: ( O(nθ) )
Space: ( O(k) ) (number of parameters)
There are some extensions to the above algorithm that help with real line segments etc.
I. Using the gradient (knowing the alignment of (x, y) in image space)
- Instead of using θ = 0 to π, use θ = grad. of α (−35 to 55)
II. Give more votes for stronger edges.
III. Vary the sampling of (d, θ) to give more/less resolution.
IV. The same procedure can be used with circles, squares or any other shape.
Page 17 of 91
Circles
Consider a circle with center ((a, b)) and radius (r):
Method 1: For a Fixed Radius
- If (r) is a fixed radius:
Image Space to Hough Space (Parameter Space)
flowchart LR
A[Image Space: (x, y)] --> B[Parameter Space: (a, b)]
- The left diagram shows image space, the middle shows parameter space (with dense votes), and the right shows the center found as the intersection for a fixed radius.
- The diagrams in the original note are not easily representable in Mermaid, so please refer to the original images for specifics.
- (INSERT IMAGE)
Methods (for Unknown Radius)
But in the real world, we don't know the radius, so:
flowchart LR
x_a[x_a] --- y_b[y_b]
x_a --> C[Parameter b]
y_b --> D[Parameter a]
- The parameter space increases drastically (RANSAC slows Hough).
- (INSERT IMAGE)
Method 3: Unknown (r), Use Gradient (Reduces Search Space)
- Vote only along a line or circle (when gradient is known).
flowchart LR
X(x, y) --> R[Radius known]
X --> S[Radius unknown: circle]
R --> O[Vote on line]
S --> P[Vote on circle]
-
Cases:
- If (r) is known: vote along a line
- If (r) is unknown: vote along a circle
-
(INSERT IMAGE for the gradient-based voting)
Page 18 of 91
Algorithm
For every edge pixel (x, y):
For each possible radius value r:
For each possible gradient direction θ:
// If θ is the estimated gradient
a = x + r * cos(θ)
b = y + r * sin(θ)
H(a, b, r) += 1
Generalized Hough Transform
Non-Analytic Models
- Parameters express variation in position or scale or fixed but arbitrary shape (that was then).
Visual Codeword-Based Features
- Not edges, but detected templates learned from models (this is "now").
Returning to Generalized Hough Transform:
- In this, voting is done via building a Hough table.
How to build Hough table?
Page 19 of 91
Training
- At each boundary point, compute displacement vector:
- Measure the gradient angle at the boundary point.
flowchart TD
C[Reference Point (c)]
P1[p1]
P2[p2]
C -- r1, θ1 --> P1
C -- r2, θ2 --> P2
- (INSERT IMAGE for the displacement diagram)
Recognition
- At each boundary point, measure the gradient angle .
- Look up all displacements in displacement table.
- Vote for a center at each displacement.
Algorithm
If orientation is known:
- do recognition point.
Else:
For each edge point:
For each possible master θ':
Compute gradient direction θ
Nearest θ': θ - θ'
For θ', retrieve displacement vectors to vote for reference point.
- Peak in this table (over ) is reference point with most supporting edges.
Page 20 of 91
Modern Approach
- Instead of using indexed displacements by gradient orientation, index by visual codeword.
flowchart LR
Type1[Type] --> P[•] --> Type2[Type]
P --> Recognize[Recognize: Car]
- (INSERT IMAGE for codeword diagram)
Training
- Build codebook of patches around extracted (interest points) using clustering.
- Map the patch around each interest point to closest codebook entry.
- For each codebook entry, store all displacements relative to object center.
Recognition
- Find all the "interest point" patches & identify visual codeword.
- Look up all the displacements in codebook.
- Vote for center at each displacement.
Page 21 of 91
Aliasing
Sampled Representation
-
Sampling: How to store & compute with continuous functions.
-
If we do
undersampling→ undersample,
information is lost. -
But, it was indistinguishable from lower frequency.
flowchart LR
A[Original Signal] --> B[Sampling] --> C[Reconstructed Signal]
-
(INSERT IMAGE for signal and sampling diagrams)
-
"Recall" guesses what the function did in between (reconstruction).
- Sampling in digital audio
- converter → [digital samples] → converter → [audio signal]
Page 22 of 91
Aliasing
- Happens when there are not enough samples.
- This makes
causes→ different signals to become indistinguishable when sampled.
How to Minimize This?
Anti-Aliasing
- Sample more often.
- But this can't go on forever.
- Make the signal less "wiggly"
- Cut off higher frequencies
- Will lose information
- Use low pass filters (if image is stored already).
Methods
Impulse Train
Comb function (in 1D):
- if , otherwise
graph LR
A[comb(x)] --> B[Impulse spikes at multiples of M]
- (INSERT IMAGE for comb diagrams)
Fourier Transform (scaling property)
- (INSERT IMAGE for FT of comb)
Page 23 of 91
Impulses in 2D
Fourier:
Sampling Low Frequency
- Diagram: function and its sampled & reconstructed forms.
graph LR
A[f(x)] --> B[sampled] --> C[reconstructed]
-
(INSERT IMAGE for function and sampling diagrams)
-
Multiply by :
- Convolution in frequency:
If there is no overlap in , then the original signal can be recovered from its samples by low-pass filtering.
Page 24 of 91
Sampling High Frequency
- Apply an anti-aliasing filter.
- Then do comb function multiplication.
(cuts)
Aliasing in Images
-
Resizing image
- Use Gaussian filter (anti-alias)
- Then do 1 of row & column (sub-sampling)
-
(INSERT IMAGE for sub-sampling process)
Page 25 of 91
Image Pyramids
-
There are two kinds of image pyramids:
- Gaussian Pyramid
- Laplacian Pyramid
-
We use these pyramids when we need to work with images of different resolutions of the same image.
Generating a Gaussian Pyramid
- Blur (convolve with Gaussian)
- Downsample (reduce image size by half)
- Upsample (double image size) - optional
Generating a Laplacian Pyramid
graph LR
L0[Gaussian L0] --> L1[Gaussian L1] --> L2[Gaussian L2] --> L3[Gaussian L3]
L1 --> Lap1[Laplacian L1]
L2 --> Lap2[Laplacian L2]
L3 --> Lap3[Laplacian L3]
- (INSERT IMAGE for pyramid diagrams)
Page 26 of 91
Reduce & Expand
- Reduce:
-
(5-tap filter)
-
Expand:
-
(3-tap filter)
-
(INSERT IMAGE for reduce/expand diagrams)
Page 27 of 91
Fourier Transform & Image Representation
Earlier times in 80s, working on image was studying frequency stuff and simpler. Still, some things need to be understood for CS in 2020.
Fourier Transform
Every image can be decomposed into a set of things that describe it.
- An important image processing tool used to decompose an image into its sine and cosine components.
Decomposing on Image Basis Sets
Let be a finite subset of a vector space over a field (such as real or complex numbers). Then is a basis if it satisfies:
Linear Independence
For all , if , then necessarily .
graph TD
v1[v1]
v2[v2]
a1[a1]
a2[a2]
a1 --> v1
a2 --> v2
v1 --linear combination--> v2
- (INSERT IMAGE for basis/independence diagram)
Spanning Property
For every in it is possible to write
- We need a good basis set. One is Fourier basis set.
Fourier Theory
Any periodic function can be written as a sum of cosines and sines of different frequencies.
Example: Square wave
Page 28 of 91
Decomposing a Signal
- To decompose a signal, we want to understand the frequency of our signal; phase is irrelevant for CV if we are not interested in reconstructing the image.
flowchart LR
f(x)[Image f(x)] --> FT[Fourier Transform] --> F(u)[F(u) = A \sin(\omega x + \phi)]
- (INSERT IMAGE for FT block diagram)
Fourier Transform Equations
- gives us the basis set.
How?
- (Independence)
- gives the winding machine exactly like signal beam.
- Integrating it (by dividing by ) gives center of mass, and dividing by gives...
Page 29 of 91
Discrete Fourier Transform (DFT)
- Where is cycles per period (periodic signal).
Thus, we get the basis vector.
Fourier Transform & Convolutions
Let :
- Convolution in spatial domain multiplication in frequency domain
Page 30 of 91
- So, it's not uncommon to calculate & , multiply them & then use IFT to get .
- FFT is Fast Fourier Transformation.
Smoothing & Blurring
- But, is multiplied instead of dividing.
Thus, attenuates higher frequency. (Use IFT)
Properties
- Linearity:
- Scaling:
- Differential:
Page 31 of 91
Camera Models & Views
New Image
A new image is a 2D representation of a 3D plane.
How to form an image?
a) Pinhole Perspective
flowchart TD
Object --> Aperture[Control Hole] --> Film
Film --> Inverted[Inverted Image]
- Aperture size → Crisper image
- If hole is too small, diffraction occurs.
- Aperture → Blurry image
b) Thin Lens
flowchart TD
P[p] --> Lens --> P'[p']
P --> Y[y]
P' --> Y'[y']
Page 32 of 91
Focus
flowchart TD
Object --> Lens --> Image
Depth of Field (DOF)
- Wide aperture lower DOF
- Small aperture higher DOF
flowchart TD
Object --> Lens[Wide Aperture] --> Image
Object --> Lens2[Small Aperture] --> Image2
- (INSERT IMAGE for DOF diagrams)
Field of View (Zoom)
- Focal length Field of view but magnification
- = sensor size or sensor width
flowchart TD
d[d] --> Lens --> phi[φ]
- (INSERT IMAGE for FOV diagram)
Page 33 of 91
Zooming & Moving
Zooming & moving are not the same.
Problems of Lenses
- Lenses are not perfectly thin
1. Geometric Distortion
flowchart LR
A[No distortion] --- B[Pin cushion] --- C[Barrel]
- No distortion
- Pin cushion
- Barrel
2. Radial Distortion
3. Chromatic Aberration
flowchart LR
Lens((Lens)) -- Red --> FocalPointR[Red Focus]
Lens -- Blue --> FocalPointB[Blue Focus]
4. Vignetting
- Dark edges around the corners in two lens systems.
Page 34 of 91
Perspective Imaging
Distorting → Perspective Imaging
Modeling Projection – Coordinate System
flowchart LR
Camera(Camera) -- x-axis -->
PF((Image Plane))
A((A)) -- (x,y,z) --> PF
B((B)) -- (x',y',-d) --> PF
Assuming +z to be image plane (virtual).
- Distant objects are smaller.
- Now, ( x' ) & ( y' ) is not a linear transformation.
So add one more coordinate:
- ((x, y) \rightarrow \begin{bmatrix} x \ y \end{bmatrix})
- ((x, y, z) \rightarrow \begin{bmatrix} x \ y \ z \end{bmatrix})
Converting from homogeneous coordinates:
- for ( z=1 ): ( x' = x/w, \quad y' = y/w )
- for ( z=0 ): ( x'=x/w, \quad y'=y/w, \quad z'=z/w )
Page 35 of 91
3D Points to 2D Image Form
flowchart LR
A((x, y, z, 1)) -- Matrix_H --> B((x', y', w'))
B -- DivideByW --> C((x'/w, y'/w))
Here now, H is homogeneous.
- In perspective imaging, all parallel lines meet at a vanishing point.
- e.g., train tracks
Mathematically, a line in 3-space is represented as:
As ( t \rightarrow \infty ), ( x'(t) \rightarrow \frac{fa}{c} ), ( y'(t) \rightarrow \frac{fb}{c} ).
So, parallel (do not) lines converge to the same point.
Other Projection Operators
3D point position:
- Perspective: ((x, y, z) \rightarrow \left(\frac{fx}{z}, \frac{fy}{z}\right))
- Weak Perspective: ((x, y, z) \rightarrow \left(\frac{fx}{z_0}, \frac{fy}{z_0}\right))
- Orthographic: ((x, y, z) \rightarrow (x, y))
Page 36 of 91
Stereo Geometry (Geometry of Multiple Views)
Visit: vision.middlebury.edu/stereo
The depth of a scene point along with structure is not directly accessible in a single image.
Because we have information when connecting an image from 3D to 2D.
flowchart LR
Eye1((P1))
Eye2((P2))
ScenePoint((Scene Point))
Eye1 -- line --> ScenePoint
Eye2 -- line --> ScenePoint
OC((Optical Center))
Eye1 -- OC
Eye2 -- OC
The trick to solving this would be to look at it through a different perspective, e.g., two eyes in animals (us).
But, can I find depth with a single eye? (Can I mean human?)
Hey dumb-head, there are other factors such as
- shading, shadow, texture, focus/defocus, motion
Also, we have two eyes:
Stereo:
- The image from one eye is different than the image from the other eye.
- Think of shape from motion or from multiple views.
- Infer 3D shape of scene.
Page 37 of 91
Basic Geometry
flowchart LR
Left((Optical Axis L)) -- line --> P((P))
Right((Optical Axis R)) -- line --> P
Left -- line --> xL((x_L))
Right -- line --> xR((x_R))
ImagePlane((Image Plane))
Left -- ImagePlane
Right -- ImagePlane
Let ( h_2 = x_L ), ( h_2 = x_R )
Now, ((c_L, p_L, h_2)) & ((c_R, p_R, c_R)) are similar.
So,
Where ( x_L - x_R = ) disparity, and ( \text{disparity} \propto \frac{1}{\text{depth}} )
Now, we need to find disparity but we don't know either ( x_L ) or ( x_R ).
We will talk about epipolar geometry.
Page 38 of 91
Epipolar Constraint
flowchart LR
O((O)) -- line --> P((P))
O -- line --> Pp((P'))
EpipolarLine((Epipolar line))
Pp -- EpipolarLine
- Given a calibrated stereo rig, the set of possible matches for the point ( p ) is constrained to lie on the associated epipolar line ( l' ).
How do we solve the correspondence problem?
- Matching point in ( O' )
No, there are still a lot of points on the line.
We need other "soft" constraints to help identify corresponding points:
- Similarity
- Uniqueness
- Ordering
- Disparity gradient is limited
Similarity
Assume image regions for the matches are similar in appearance.
For each pixel/window in the left image:
- Compare with every pixel/window on the same epipolar line in right image.
- Pick position with minimum match cost (SSD, norm, correlation).
Page 39 of 91
More Constraints
- Cross-correlation might not work well. It depends a lot on window size.
2. Uniqueness
- No more than one match in right image for each window in left image.
3. Ordering
- abc in left image is abc in right image.
- Won't always hold (e.g., occluded transparent object).
Example: Close one eye each time, with both index fingers in front of nose and in a line.
There are a lot of algorithms solving it, but there are still some challenges that stereo isn't possible to solve:
- Low-contrast, textureless image regions
- Occlusions
- Violation of brightness constancy (e.g., specular reflections)
- Larger baseline
- Camera calibration errors
Algorithms
- Scanline (DP) - Does not have stretching models.
- Full 2-D grid as energy minimization:
Page 40 of 91
Geometric Camera Calibration
Coordinate Transforms
flowchart LR
WorldCoords[World Coords]
CameraCoords[Camera Coords]
FilmCoords[Film Coords]
PixelCoords[Pixel Coords]
WorldCoords -- Pattern/Translation --> CameraCoords
CameraCoords -- Projection --> FilmCoords
FilmCoords -- PixelScale --> PixelCoords
We want to use the camera to get the true x, y, z of the world. So we need the relationship between coordinates of the world & coordinates in the image.
I. Extrinsic Camera Parameters
- From arbitrary world coordinate system to the camera's 3D coordinate system. (Camera pose)
II. Intrinsic Camera Parameters
- From the 3D coordinates in the camera frame to the 2D image plane via projection.
flowchart LR
World((World Coord))
Camera((Camera Coord))
World -- extrinsic --> Camera
A rigid body has 6 degrees of freedom (cube).
flowchart TD
Cube((Cube))
Cube -- 6 DOF -->
Translation
A point A in world coordinate is translated to point B in camera coordinate with a relation:
(Origin A expressed in B frame)
Page 41 of 91
Matrix Multiplication for Transformation
We can express the same equation using matrix multiplication with the help of homogeneous coordinates (i.e., linear transformation):
Rotation
Also, by homogeneous coordinates:
Commutativity: ❌
Now, let's do a total rigid transformation:
- A point ( A ) in system is rotated and then translated (offset) in the A system to get B system.
Page 42 of 91
Using Homogeneous Coordinates for Transformation
flowchart LR
AP[Part in Camera Form] -- Transformation --> BP[Part in World Form]
TransformationMatrix[Extrinsic Parameter Matrix]
AP -- TransformationMatrix --> BP
- 6 degrees of freedom
Intrinsic Camera Parameters
flowchart LR
Camera((Camera))
ImagePlane((Image Plane))
Camera -- projection --> ImagePlane
Origin of camera coordinate system is at the corner of the viewing and not at its center. So, ( u_0 ) & ( v_0 ), 4 degrees of freedom.
(Principal point)
Page 43 of 91
Skew in Camera Coordinate System
Also, the camera coordinate system may be skewed due to some manufacturing error, so angle ( \theta ) b/w two image axes is not equal to 90°.
So, 5 degrees of freedom.
Combining Extrinsic & Intrinsic Calibration Parameters
(pixel coords ← world coords)
Page 44 of 91
Camera Calibration
Where:
Intrinsics | Projection | Rotation | Translation
Calibration using Known Points (Direct Linear Transformation)
- Place a known object in the scene
- Identify correspondence between image and scene
- Compute mapping from scene to image
Remove scale/degenerate cases
Now, we need to find all 11 degrees of freedom. It is done using SVD tricks.
- Find the ( m ) that minimizes ( |Am| ) subject to ( |m|=1 )
- Let ( A = UDV^T ) (SVD, diagonal)
Translation: minimizing ( |UDV^T m| )
But, ( |UDV^T m| = |DV^T m| ) & ( |m|=1 \implies V^T m ) is unit vector.
Thus, minimize ( |DV^T m| ) subject to ( |V^T m|=1 )
Let ( y = V^T m ), so ( |Dy| ) is minimized s.t. ( |y|=1 ).
Page 45 of 91
Eigen Decomposition
Also, D is diagonal with decreasing values. So, ( |Dy| ) minimum is when ( y = (0,0,...,1)^T ).
Since ( y = V^T m ), ( m = Vy ) as ( V ) is orthogonal (( V^{-1} = V^T ))
This gives last column in V (eigen decomposition).
The final step is:
- Given ( Am=0 ), find the eigenvector of ( A^T A ) with smallest eigenvalue, that's ( m ).
After finding ( M ), we should be able to find things like the camera center from ( M ).
Two ways:
- Pure way
- Easy way
II. Multi-plane Calibration
Advantages:
- Requires a plane only (checker-board)
- Don't need to know positions/orientations
- Direct code available in OpenCV
Zhengyou Zhang (Goal of Calibration)
Page 46 of 91
Multiple Views
Image to Image Projections
I. Affine Transformation
-
6 degrees of freedom
-
Transformation: 2
-
Rotation: 1
-
Scale: 1
-
Shear: 2
II. Homography
(Divide by ( w ))
- Homography: 8 degrees of freedom (This transformation maps between any two planes)
Homographies and Mosaics
Rather than thinking of this as a 3D reprojection on 2D plane, we usually think of it as a 2D image warp from one image plane to another image plane.
flowchart LR
Plane1 -- warp --> Plane2
Page 47 of 91
Panoramas and Homography
This gives an idea of panoramas.
How to Stitch Together a Panorama
- Take a sequence of images from the same position, rotate the camera about its optical center.
- Compute transformation b/w second image plane & first.
- Transform the second image to overlap with the first.
- Blend the two together to create a mosaic (repeat if more than two).
flowchart LR
Camera[Camera]
Plane1[Image Plane 1]
Plane2[Image Plane 2]
Camera --rotate--> Plane1
Camera --rotate--> Plane2
Plane1 --transform--> Plane2
- The mosaic has an anatomical overlap/interpolation in 3D.
- The images are reprojected onto a common plane.
- The mosaic is formed on this plane.
For views upto 180°, beyond that need a sequence that bends the way or map onto a different surface, say, a cylinder.
How to Solve Homography?
- Since homography is transformation b/w any two planes:
(8 degrees)
Page 48 of 91
Solving for Homography
- Requires at least 4 points for 8 eqs.
Non-homogeneous way
Solve for ( h ) by min( ||Ah - b||^2 ) using least-squares.
Homogeneous way
- Solve using SVD.
Applications of Homography (Not mosaics, dummy dumb)
i) Video motion (computation of camera motion)
ii) Image rectification
- Rectifying scanned views (like CamScanner does)
flowchart LR
Skewed[Skewed Image]
Rectified[Rectified Image]
Skewed --rectify--> Rectified
How do we do this?
Page 49 of 91
In case of homogeneous coordinates
-
Line joining two points ( p_1 ) & ( p_2 ) is given by:
Projective line (image plane)
A line is a plane of rays through the origin, defined by the normal for
All rays ( (x, y, z) ) satisfying:
Point & Line Duality
-
A line ( l ) is a homogeneous 3-vector
%% Flowchart: Projective points, lines and intersection
graph TD
P1([p1])
P2([p2])
L([l = p1 x p2])
L1([l1])
L2([l2])
P([p = l1 x l2])
P1 -- "join" --> L
P2 -- "join" --> L
L1 -- "intersection" --> P
L2 -- "intersection" --> P
%% Block Diagram: Incidence of point p on lines l1 and l2
graph LR
l1([l1])
l2([l2])
p([p])
l1 -- "incident" --> p
l2 -- "incident" --> p
(go to top now)
Page 50 of 91
Essential Matrix
Consider again the stereo problem, a 3D world point:
%% Sequence diagram: Stereo camera geometry
sequenceDiagram
participant O_L as O_L (Left Camera)
participant O_R as O_R (Right Camera)
O_L->>P: projects to p_L
O_R->>P: projects to p_R
In case of calibrated cameras:
Let, ( E = T \times R )
or
Called essential matrix.
- ( E x ) is a vector ( l ) (line)
Thus, focuses a point, lies on an epipolar line.
Page 51 of 91
Fundamental Matrix
Motivation:
- Estimate epipolar geometry from a set of point correspondences between two uncalibrated cameras.
We know,
Camera (\rightarrow) Image
- This gives the ray in space (as it is homogeneous).
And for two cameras:
( F ) is the epipolar line in the image associated with ( p_{im,left} ) and vice versa (( l' = F p )).
Page 52 of 91
Properties of Fundamental Matrix
-
Epipoles found by ( Fp = 0 ) & ( F^T p = 0 )
- A point which exists on every epipolar line.
- (In parallel image planes), epipoles are at infinity.
-
( F ) is singular (mapping from homogeneous 2-D point to 1-D family is rank 2).
- If we estimate fundamental matrix from correspondences in pixel coords, can we construct epipolar geometry without intrinsic or extrinsic parameters.
How to find it? (FM gives relation b/w any two views that give an idea of video motion.)
-
Each point generates one constraint or single eqn on ( F ):
-
Requires at least 8 points to solve for ( F ) (9 variables, divide 1 variable).
-
Now, do OLS or orthogonal least squares.
-
8 variables left, 8 equations.
- Most famous: SVD approach (very minimal error)
- To enforce SVD (same rank of ( F )):
Page 53 of 91
Applications of Stereo
- Stereo Rectification
- Photo Synthesis
Summary
- For 2 views, there is a geometric relationship between rays in one view to rays in the other (\rightarrow) epipolar geometry.
- These relationships can be captured algebraically as well:
calibrated Essential Matrix→ calibrated: Essential Matrixuncalibrated Fund. matrix→ uncalibrated: Fund. matrix
- This relation can be estimated from point correspondences.
Another problem?
How to find these point correspondences?
- Manually
- Math
We will use features to find these.
Page 54 of 91
Computer Vision
Features & Matching
Introduction to "features":
To get a transformation of one image to another image, we need corresponding points and we do that using "local features". → get a transformation from one image to another, we need corresponding points, and we do that using "local features".
Goal: Find points that are:
- Found in other images
- Found precisely
- Found reliably
Characteristics of good features
- Precision/Repeatability
The same feature can be found in several images despite geometric & photometric transformations. - Saliency/Matchability
Each feature has a distinctive property. - Compactness/Efficiency
Many fewer features than image pixels.
Page 55 of 91
Finding Corners
+----------+ +---------+ +---------+
| Flat | | Edge | | Corner |
| | | | | |
| | | | | |
+----------+ +---------+ +---------+
No change No change Significant change
in all along the in all direction
directions edge direction with small shift
(INSERT IMAGE: schematic diagrams for flat/edge/corner)
How, in this eqn, we want to look for small shifts (localization).
Second-order Taylor expansion of ( E(u, v) ) about (0, 0):
For 1D:
For 2D:
Page 56 of 91
Now,
- ( E_u(0, v) = I_y )
- ( E_v(u, 0) = I_x )
Putting all in eqn, we get:
Let
So,
- Investigating it, (consider a slice of ( E(u, v) )):
Page 57 of 91
The eqn is of ellipse.
- The gradient vectors as a set of (dx, dy) points with a center of mass defined as being at (0, 0) fit an ellipse to the set of pairs via scatter matrix.
Now, again, consider the ( M ) matrix (second moment):
Case I
- Gradients are either horizontal or vertical but not slanted for ellipse
Case II
- Gradient is only horizontal or only vertical
- either ( \lambda_1 = 0 ) or ( \lambda_2 = 0 )
- (Not a good corner!)
Note:
( \lambda_1 ) & ( \lambda_2 ) determine the corner based on the property of diagonalization of ( M ) matrix.
How? By concluding them as eigenvalues.
Page 58 of 91
- ( \lambda_1 ) & ( \lambda_2 ) are eigenvalues of ( M ).
Case III
- ( \lambda_1 ) & ( \lambda_2 ) are too small
- Flat region
Case IV
- ( \lambda_1 \gg \lambda_2 ) or ( \lambda_2 \gg \lambda_1 )
- Edge
Case V
- ( \lambda_1 ) & ( \lambda_2 ) are too large
- ( \lambda_1 \approx \lambda_2 )
- Corner
But, Harris algorithm makes use of "response function":
and ( \alpha ) constant (( 0.04 ) to ( 0.06 ))
So,
- ( |R| ) small → flat
- ( R < 0 ) → edge
- ( R > 0 ) → corner
Harris detector: Algorithm
- Compute gaussian derivatives at each pixel.
- Compute second moment matrix ( M ) in a gaussian window around each pixel.
- Compute corner response function ( R ).
- Threshold ( R ).
- Find local maxima of response function (non-maximum suppression).
Page 59 of 91
Thus, we get good features on both images.
Properties of Harris detector:
- Invariant to Rotation
- Intensity – invariant to addition
Threshold issue in multiplication, but invariant→ Threshold issue in multiplication, but invariant - Not invariant to scale
How to do scale invariance?
Soln: (Idea)
Design a function on the region, which is "scale invariant", not affected by the size but will be the same for corresponding regions even if they are at different scales.
One method:
At a point, compute the scale-invariant function across different size neighborhoods (different scales).
%% XY Chart: Scale-space neighborhood
xychart
"region size S1" : 1, 2, 3, 4, 5
"image scale 1" : 2, 3, 4, 5, 6
"region size S2" : 1, 2, 3, 4, 5
"image scale 2" : 3, 4, 5, 6, 7
(INSERT IMAGE for scale-space neighborhood plot)
We get scale sizes ( S_1 ) & ( S_2 ).
Page 60 of 91
Another problem, we need a good function for scale-invariant detection, that has one stable sharp peak.
%% XY Chart: Good vs bad function for scale invariance
xychart
"bad" : 0, 1, 1, 1, 1
"bad" : 0, 1, 0, 1, 0
"good" : 0, 0, 1, 0, 0
(INSERT IMAGE for function response curves)
A good function would be one which responds to contrast.
Laplacian of Gaussian - LoG
It is a little costly to find.
So instead we use DoG, that is essentially the same:
(Difference of Gaussian)
Both kernels are invariant to scale & rotation.
Page 61 of 91
Now, we get interest points but we still need to match them, that is the work of descriptor.
SIFT
- Motivation:
The Harris operator was not invariant to scale (but we fixed that in 2004) and co-relation was not invariant to rotation
(normal brute-force matching).
Develop→ Done- Detector → invariant to scale & rotation
- Descriptor → robust to variations in viewing conditions.
- Idea of SIFT
Match enough SIFT features that are invariant to scale, transformation, photometry, etc. to do computer vision recognition.
- Overall SIFT procedure
- Scale-space extrema detection (or use Harris, Laplacian, or other method)
- Keypoint localization
- Orientation assignment
- Keypoint description
Page 62 of 91
Orientation Assignment
An orientation is assigned to each keypoint to achieve invariance to image rotation.
%% Sequence diagram: Orientation assignment in keypoints
sequenceDiagram
participant I1 as Image 1
participant FP as FP
participant I2 as Image 2
participant FP1 as FP'
I1->>FP: assign orientation
I2->>FP1: assign orientation
Note right of FP: Matching is done as orientation is known
Keypoint descriptors
Next is to compute a descriptor for the local image region about each keypoint that is:
- Highly distinctive
- As invariant as possible
But first normalization:
- Rotate to standard orientation
- Scale the window size
%% Block diagram: SIFT feature vector extraction
flowchart LR
A[feature point] --> B[16x16 neighbourhood]
B --> C[Histogram with 8 bins = 128 bin values]
C --> D[SIFT feature vector]
(INSERT IMAGE for feature descriptor schematic)
Page 63 of 91
Now, we know how to describe them but how about matching? (We’re still not done 😊)
Approach 1: ( n^2 ) Brute force
Approach 2: Nearest Neighbours
- k-D tree
- Best Bin First
Approach 3: Wavelet-Based Matching (Hashing)
- Compute a short (3-vector) descriptor from the neighbourhood
- Quantize each value into 10 (overlapping) bins ( 10^3 ) entries
3D Object Recognition
Train:
- Extract outliers with background subtraction
- Compute "keypoints" = interest points & descriptors
Test:
- Find possible matches
- Search for consistent solution (such as affine)
Page 64 of 91
Feature-Based Alignment
Overall strategy
- Compute features: detect → Harris, describe → SIFT
- Find some useful matches – kd-tree, BBF (Best Bin First)
- Compute & apply the best transformation – Affine, Homography
Algorithm
- Extract features
- Compute putative matches → "closest descriptor"
(Not known how to do) - Loop until happy:
- Hypothesize transformation ( T ) from some matches
- Verify transformation (search for other matches consistent with ( T )) – mark best
Here, step 2 isn’t encountered yet, so teaching it now.
If we pursue exhaustive search, taking Nearest Neighbour technique to give the best match, how do we know it’s a good one?
Page 65 of 91
How do we know a good match is found?
Feature Space Outlier Rejection
- Let's not match all features, but only those that have "similar enough" matches.
- How can we do it?
- SSD (patch1, patch2) < threshold
- But how to threshold?
It unknown→ also unknown
Probability Density for Match Quality
%% Flowchart for match thresholding
graph LR
A[Probability Density]
B[Good Matches]
C[Incorrect Matches]
D[Not a good threshold]
A --> B
A --> C
C --> D
-
A point's best matches are
sometimes correct→ sometimes correct, sometimes not. -
So, we use Lowe's method to use 2-NN and compare it with 1-NN.
%% Flowchart for Lowe's Method
graph LR
A[Probability Density]
B[Threshold]
C[1-NN / 2-NN]
D[Increased Error]
A --> B
B --> C
C --> D
Page 66 of 91
Outlier Removal and RANSAC
Even when picking the best match, still lots of wrong matches ("outliers") remain, because we did the distinctive property thing.
- Essentially, we need to fit a model such that the effect of outliers is minimized.
RANSAC (Random Sample Consensus)
- Randomly pick some points to define your line (model). Repeat enough times until you find a good line (model) - one with many inliers.
- It somehow copes with a large proportion of outliers.
- Every general model has a minimal set:
- For line: it's 2
- For image transformations: s points
- Translation: One pair of matched points
- Homography (for plane): 4
- Fundamental Matrix: 8 point pairs (in reality, 7)
Page 67 of 91
RANSAC Algorithm & Parameters
Algorithm
- Randomly select s points (or point pairs) to form a sample.
- Instantiate the model.
- Get consensus set ( C_i ) - the points within error bounds (distance threshold) of the model.
- If ( |C_i| > T ), terminate & return model.
- Repeat for N trials, return model with max ( C_i ).
Parameters
- ( s ): Initial no. of points (known)
- Distance threshold ( t ):
Known→ Unknown- Choose so prob. for inlier is high (95%)
- Remain Gaussian noise within std. dev. ( t^2 = 3.84 )
- Number of trials ( N )
- Choose ( N ) so that, with probability ( p ), at least one random sample set is free from outliers (e.g., ( p = 0.99 ))
- Need to set ( N ) based upon the outlier rate
Page 68 of 91
Calculating ( N )
-
( s ): no. of points to compute solution
-
( p ): prob. of success
-
( e ): proportion of outliers, so inliers = ( (1-e) )
-
Probability sample set with all inliers: ( (1-e)^s )
-
Probability sample set will have at least one outlier:
-
Probability all N samples have outlier:
- Lastly, do re-compute least-squares H estimate on all of the inliers.
Page 69 of 91
Adaptive Algorithm
- Set ( N = \infty ), sample_cnt = 0, ( e = 1.0 )
- while ( N > ) sample_cnt:
- choose s & count the no. of inliers
- set ( e_0 = 1 - \frac{no. ; of ; inliers}{no. ; of ; points} )
- if ( e_0 < e ), set ( e = e_0 ) & re-compute ( N ) from ( e ):
Page 70 of 91
PHOTOMETRY
Factors affecting image:
- Lights, Surfaces & Shadows
- Reflections
- Refractions
- Inter-reflections
- Scattering
Surface Appearance
%% Block Diagram for Surface Appearance
graph TD
Source --> SurfaceElement --> Sensor
SurfaceElement -->|Normal| SurfaceElement
- Image Intensity = f (Observed surface reflectance, illumination)
- Surface reflection depends on both the viewing & illumination directions.
Page 71 of 91
Introduction to Motion
- A video is a sequence of frames captured over time — usually quickly.
- ( I(x, y, t) )
- x, y: space
- t: time
Motion Applications
- Background Subtraction
- Remove background that is not moving.
- Shot boundary detection
- Motion Segmentation
- Segment the video into multiple coherently moving objects.
- Goals: Manufacture information in motion like brain does.
- Estimating 3D structure
- Learning dynamic models
- Recognizing events & activities
- Improving video quality
Page 72 of 91
Motion Estimation Techniques
I. Feature-based methods (Done ✓)
- Extract visual features, track them over multiple frames.
- Sparse motion fields, but more robust tracking.
- Suitable when image motion is large.
II. Direct, dense methods
-
Directly recover image pattern at each pixel from "spatio-temporal image brightness variation".
-
Dense motion fields but sensitive to appearance variations.
-
Suitable for video & when image motion is small.
-
State of the art → Combination of both above.
Dense Flow: Brightness Constraint
i) Optic Flow
- It is the apparent motion of objects or surfaces.
Problem setup:
%% Sequence Diagram for Optic Flow
sequenceDiagram
participant Img1 as I(x,y,t)
participant Img2 as I(x,y,t+1)
Img1->>Img2: Track motion of points
Page 73 of 91
How to estimate pixel motion: from image 1 to 2
- Solve pixel correspondence problem:
- Given a pixel in ( I(x, y, t) ), look for nearby pixels of the same color in ( I(x, y, t+1) )
- This is the optic flow problem.
Assumptions
-
color constancy:
Combining both eq.
$$
I(x+u, y+v, t+1) - I(x, y, t) = 0
$$
$$
I_x u + I_y v + I_t \approx 0
$$
$$
I_t + I_x u + I_y v = 0
$$
Page 74 of 91
I_t + \nabla I \cdot \langle u, v \rangle = 0
(Brightness constancy constraint eq.) - 2 unknowns but 1 eqn (\( u, v \)) - Similar to: \( a u + b v + c = 0 \) - \( I_x, I_y, I_t \) - [So optical flow is "constrained" to be on a line] Intuitively, this constraint means that: - The component of the flow in gradient direction is determined (called Normal Flow) \( I_x, I_y \) - The component of the flow parallel to an edge is unknown (called aperture problem) --- > Page 75 of 91 ```mermaid %% Illustration for aperture problem (cannot be coded, so placeholder) %% (INSERT IMAGE) ``` - So, optical flow can't be computed from just brightness constancy constraint; we need additional constraints such as: 1. Smooth optical flow (motion should be consistent in whole image) - Formulate error in optical flow constraint: $$ e_c = \iint_{img} (I_x u + I_y v + I_t)^2 dx dy $$ 2. Smoothness constraint: $$ e_s = \iint_{img} (u_x^2 + u_y^2 + v_x^2 + v_y^2) dx dy $$ - Find \( (u, v) \) at each img point that minimizes: $$ e = e_s + \lambda e_c $$ --- > Page 76 of 91 ## Local Methods: Lucas-Kanade - The flow field is smooth locally. - Method: Pretend the pixel's neighbors have the same \( (u, v) \). - For a 5x5 window, 25 eqns, 2 unknowns. - Do least squares & solve: $$ A d = -b $$ $$ A^T A d = -A^T b $$ $$ \begin{bmatrix} \sum I_x I_x & \sum I_x I_y \\ \sum I_x I_y & \sum I_y I_y \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix} = - \begin{bmatrix} \sum I_x I_t \\ \sum I_y I_t \end{bmatrix} $$ - Tip: is solvable when \( A^T A \) is invertible. - Which means \( \lambda_1 \) & \( \lambda_2 \) should not be too small. - \( \lambda_1 \) must not be too large compared to \( \lambda_2 \), where \( \lambda_1, \lambda_2 \) are eigenvalues. --- > Page 77 of 91 ### Errors in Lucas-Kanade (Because of assumptions) - A point doesn't move like its neighbors → motion segmentation - The motion is large (larger than a pixel) → Taylor expansion doesn't hold - Non-linear, iterative refinement - Total minimize cost to fine estimate - Brightness constancy doesn't hold → exhaustive neighborhood search required #### Iterative Refinement Algorithm 1. Estimate velocity at each pixel by solving Lucas-Kanade equations. 2. Warp \( I_t \) towards \( I_{t+1} \) using the estimated flow field. 3. Repeat until convergence. ```mermaid %% Placeholder for refinement path illustration %% (INSERT IMAGE) ``` --- > Page 78 of 91 ## How to eliminate the problem of large motion? - Reduce the resolution until it moves only in range of 1 px motion. - Coarse image → fine image ### Hierarchical LK 1. Create a Gaussian pyramid - At top, it is a blurry picture (coarse) 2. Run iterative LK on the top-level images ```mermaid %% Block Diagram for Image Pyramid graph TD A[Coarse Image] --> B[Finer Image] B --> C[Finest Image] ``` - Move downwards & repeat until you get the exact motion Before moving forward, let's take a detour to learn about aliasing & image pyramids. --- > Page 79 of 91 ## Formal Algorithm 1. Compute iterative LK at level K. 2. Init. \( u_{K+1}, v_{K+1} = 0 \) at size of level K+1. 3. For each level \( K \) to 0: - Upsample (expand) \( u_{K+1}, v_{K+1} \) to create \( u'_K, v'_K \) (flow field of new twice resolution as level K) - Multiply \( u'_K, v'_K \) by 2 to get predicted flow - Warp level K Gaussian version of \( I_2 \) as per predicted flow to create \( I'_2 \) - Apply LK between \( I'_2 \) & level K Gaussian version of \( I_1 \) to get \( u^s_K, v^s_K \) (corrections in flow) - Add corrections: $$ u^t_K = u'_K + u^s_K \\ v^t_K = v'_K + v^s_K $$ --- > Page 80 of 91 ## Sparse LK - The Lucas-Kanade algorithm described gives a dense field: \( (u, v) \) everywhere. - But we said that we only want to solve LK where eigenvalues are well-behaved. - Sparse LK ⇒ hierarchical LK applied to good feature locations. ## Today's State of the Art Methods Start with something similar to LK: - + gradient constancy - + region matching - + energy minimization with smoothing term - + keypoint matching ## Motion Models There are 4 types of motion in real world: 1. Translation (2 unknowns) 2. Rigid (similarity) (4) 3. Affine (6) 4. Homography/Perspective (8) --- > Page 81 of 91 Now, in real worldV = \omega \times R + T
\begin{bmatrix} V_x \ V_y \ V_z \end{bmatrix}
\begin{bmatrix} 0 & -\omega_z & \omega_y \ \omega_z & 0 & -\omega_x \ -\omega_y & \omega_x & 0 \end{bmatrix} \begin{bmatrix} X \ Y \ Z \end{bmatrix} + \begin{bmatrix} V_{Tx} \ V_{Ty} \ V_{Tz} \end{bmatrix}
x = f \frac{X}{Z}
y = f \frac{Y}{Z}
\frac{dx}{dt} = u = v_x = f \frac{V_x}{Z} - x \frac{V_z}{Z}
\frac{dy}{dt} = v = v_y = f \frac{V_y}{Z} - y \frac{V_z}{Z}
\begin{bmatrix} u(x, y) \ v(x, y) \end{bmatrix}
\frac{1}{Z(x, y)} A(x, y) T + B(x, y) \Omega
A(x, y) = \begin{bmatrix} f & 0 & x \ 0 & f & y \end{bmatrix}
B(x, y) = \begin{bmatrix} \frac{xy}{f} & -\left(f + \frac{x^2}{f}\right) & y \ \left(f + \frac{y^2}{f}\right) & -\frac{xy}{f} & -x \end{bmatrix}
--- > Page 82 of 91 # Introduction to Tracking - Recognizing some object through frames in different timelines is called tracking. Some great tracking examples: 1. Surveillance camera 2. Camera tracking under occlusion ## Tracking Challenges One idea to do tracking is ~~optic~~ → optical flow (between) first & second, second & third & so on, but: i) It's hard to compute optical flow. ii) There can be large displacements since objects could be moving rapidly - Probably dynamics need to be taken into account ~~would do it~~ → to address this. iii) Occlusions --- > Page 83 of 91 - **Shi-Tomasi Feature Tracker** **Algorithm:** - From frame site to frame, track with LK and a pure translational model. - Robust for small displacements. 2. Check consistency of trackers by affine registration to the first (or earlier) observed instance of the feature. - Affine model is more accurate for larger displacements. - Comparing to the first or early frames helps to minimize drift. > This above method will fail for large displacements. - **Tracking with Dynamics** Key idea: Given a model of expected motion, predict where objects will occur in the next frame even before seeing the image. - Restrict search for the object. - Improved estimates since measurement noise is reduced by trajectory smoothness. --- > Page 84 of 91 ## Assumptions - Objects do not disappear and reappear in different places in the scene. - Camera is not moving instantaneously to a new viewpoint. - Gradual change in pose between camera and scene. ## Tracking as Inference Hidden State (X): True parameters we care about Measurement (Y): Noisy observation of underlying state At each time step *t*, state changes (from \( X_{t-1} \) to \( X_t \)) and we get a new observation \( Y_t \). Our goal: Estimate distribution of state \( X_t \) given 1. all observations seen so far 2. knowledge about dynamics of state transitions ## Steps of Tracking i) **Prediction:** What is the next state of the object given past measurements?P(X_t \mid Y_0, Y_1, \ldots, Y_{t-1}, U_{1:t}, Y_t)
--- > Page 85 of 91 ii) **Correction:** Compute an updated estimate of the state from prediction & measurementP(X_t \mid Y_0, Y_1, \ldots, Y_{t-1}, Y_t = y_t)
Tracking: The process of propagating this posterior distribution of state given measurements across time. ### Some Assumptions i) Only the immediate past matters:P(X_t \mid X_0, X_1, \ldots, X_{t-1}) = P(X_t \mid X_{t-1})
P(Y_t \mid X_0, X_1, \ldots, X_{t-1}, X_t) = P(Y_t \mid X_t)
(Observation model) --- > Page 86 of 91 ## Tracking as Induction **Base case:** Assume we have some initial prior that predicts state in the absence of any evidence \( p(X_0) \). - At the first frame, correct this, given value of \( Y_0 \). ### Prediction Step Given: \( P(X_{t-1} \mid y_0, \ldots, y_{t-1}) \) Guess: \( P(X_t \mid y_0, \ldots, y_{t-1}) \) Law of Total Probability - Marginalization:\int P(X_t, X_{t-1} \mid y_0, \ldots, y_{t-1}) dX_{t-1}
= \int P(X_t \mid X_{t-1}, y_0, \ldots, y_{t-1}) P(X_{t-1} \mid y_0, \ldots, y_{t-1}) dX_{t-1}
= \int P(X_t \mid X_{t-1}) P(X_{t-1} \mid y_0, \ldots, y_{t-1}) dX_{t-1}
### Correction Step Given predicted value \( P(X_t \mid y_0, \ldots, y_{t-1}) \) & \( y_t \): ComputeP(X_t \mid y_0, \ldots, y_t) \propto P(y_t \mid X_t, y_0, \ldots, y_{t-1}) P(X_t \mid y_0, \ldots, y_{t-1})
P(y_t \mid X_t)
--- > Page 87 of 91 # Linear Dynamics Model ## Dynamics Model State undergoes linear transformation, plus Gaussian noiseX_t \sim N(D_t X_{t-1}, \Sigma_{dt})
- Normal distribution, mean, Gaussian noise/covariance ## Linear Measurement Model Observation model: Measurement is linearly transformed state plus Gaussian noiseY_t \sim N(C_m X_t, \Sigma_{mt})
--- ## Kalman Filter ~~Quite a tradeoff between inconsistency of~~ → Combines prediction & measurement. It is a method of tracking linear dynamic systems with Gaussian noise. ### Kalman Prediction - Mean at time t: dynamics multiplies old mean - Corrected mean at time (t-1) - Variance at t: \( D^2 \) (corrected variance at t-1) + additional noise --- > Page 88 of 91 ## Nomenclature - \( \mu_t^-, \sigma_t^- \): Mean, covariance before measurement - \( \mu_t^+, \sigma_t^+ \): After measurement (vice versa) ## Kalman Correction (for 1-D)\mu_t^+ = \frac{\mu_t^- \sigma_m^2 + m y (\sigma_t^-)^2}{\sigma_m^2 + m^2 (\sigma_t^-)^2}
(\sigma_t^+)^2 = \frac{\sigma_m^2 (\sigma_t^-)^2}{\sigma_m^2 + m^2 (\sigma_t^-)^2}
\mu_t^+ = \mu_t^- + K \left( y_t - m \mu_t^- \right)
Where \( K \) is the Kalman gain. Residual = \( y_t - m \mu_t^- \) - Also, variance lowers after measurement. ### Pictorial Representation ```mermaid flowchart LR A[p(x)] -->|Deterministic drift| B[p(x)] A2[p(x)] -->|Stochastic diffusion| B2[p(x)] A3[p(x)] -->|Negative effect of measurement| B3[p(x)] ``` --- > Page 89 of 91 - Simple updates, compact & efficient **Cons** - Unimodal distribution, only single hypothesis - Restricted class of motions defined by linear model - Extension called "Extended KF Filter" What about Non-Gaussian? ## Particle Filter Here \( z_t \) = measurements - Density is represented by particles, where the positions and their weight. ```mermaid flowchart LR A[p(x)] --> B[Particles (set of n samples)] ``` Goal:p(x_t \in X_t) \approx p(X_t \mid z_{1:t})
with equality when \( n \to \infty \) (number of particles). --- ## Bayes Filter --- > Page 90 of 91 1. Prior probability of the system states \( p(x) \) 2. Action (dynamical system) model \( p(x_{t+1} \mid u_t, x_t) \) 3. Sensor model ("likelihood") \( p(z \mid x) \) 4. Stream of observations \( z \) and action data \( u \) - data: \( t-1, u_{t-1}, z_2, \ldots, u_{t-1}, z_t \) Wanted: 1. Estimate of the state X at time t, 2. Posterior of the state is also called belief:Bel(x_t) = p(x_t \mid u_{1:t}, z_{1:t})
#### Fluctuations of input (diagram) ```mermaid flowchart LR U1[U_{t-1}] --> X1[X_{t-1}] X1 --> Z1[Z_{t-1}] X1 --> X2[X_t] U2[U_t] --> X2 X2 --> Z2[Z_t] X2 --> X3[X_{t+1}] U3[U_{t+1}] --> X3 X3 --> Z3[Z_{t+1}] ```p(Z_t \mid x_{0:t}, z_{1:t-1}, u_{1:t}) = p(Z_t \mid X_t)
p(X_{t+1} \mid x_{1:t-1}, z_{1:t-1}, u_{1:t}) = p(X_{t+1} \mid x_t, u_t)
--- > Page 91 of 91 ## Bayes FilterBel(x_t) = p(x_t \mid o_{1:t}, u_{1:t}, z_{1:t}, z_{1:t-1})
= \eta p(z_t \mid x_t) p(x_t \mid u_t, z_{1:t-1}, u_{1:t}, x_{t-1})
= \eta p(z_t \mid x_t) p(x_t \mid u_t, z_{1:t-1}, u_{1:t-1}, x_{t-1})
= \eta p(z_t \mid x_t) \int p(x_t \mid u_t, x_{t-1}) Bel(x_{t-1}) dx_{t-1}
Bel(x_t) = \eta p(z_t \mid x_t) \int p(x_t \mid u_t, x_{t-1}) Bel(x_{t-1}) dx_{t-1}