Linear Algebra for AI: Path 2 – Rapid Learner

Fast-track linear algebra for ML when you've seen the basics before but need to connect it all to machine learning. Linear Algebra for AI: Path 2: Beatriz (Rapid Learner).

Linear Algebra for AI: Path 2 – Rapid Learner
Photo by Unseen Studio / Unsplash

Path 2: Beatriz (Rapid Learner)

Meet Beatriz: You took calculus and maybe linear algebra in college, but that was 3-5 years ago and you've forgotten most of it. You're working as a data analyst, software engineer, or product manager. You're comfortable with Python, SQL, and basic statistics. You've used scikit-learn but want to actually understand what's happening. You don't need to prove theorems, but you want solid intuition and practical skills.

Core Philosophy: Rapid review of foundations, then jump straight to ML applications. Balance theory and practice. Emphasis on "when to use what."


Week 1-2: Rapid Foundation Review

You need to reactivate dormant knowledge quickly.

🎥 3Blue1Brown: Essence of Linear Algebra (Full Series)

Watch all 16 videos: Complete Playlist

Why start here: Even though you've seen this before, the geometric intuition will be new and will drastically improve your understanding.

Binge-watch strategy:

  • Videos 1-8: Foundational (pause and code examples)
  • Videos 9-12: More advanced (focus on eigenvectors, dot products)
  • Videos 13-16: Abstract concepts (watch for intuition, don't stress details)

Active watching:

  • Pause after each major concept
  • Implement in NumPy immediately
  • Create your own examples

Key concepts to cement:

  • Linear transformations as matrices
  • Matrix multiplication as composition
  • Determinant as scaling factor
  • Eigenvectors as stable directions
  • Dot products as projections

🎓 Strang 18.06: Rapid Core Sequence (Lectures 1-10, 14-16)

You're watching for refresh and depth, not learning from scratch.

Week 1 Focus: Systems and Spaces

Lecture 1: The Geometry of Linear Equations

  • Three views of Ax=b: You saw this in college, now it'll actually make sense
  • Code along: Solve 3x3 system both geometrically and algebraically

Lecture 2: Elimination with Matrices

  • Gaussian elimination: The algorithm you learned but probably forgot
  • Implement: Basic elimination with NumPy

Lecture 3: Multiplication and Inverse Matrices

  • Five ways to multiply: Focus on column and row pictures
  • When does inverse exist? Connect to data science (singular matrices = collinear features)

Lecture 5: Transposes, Permutations, Vector Spaces

  • Transpose: You use this constantly in ML without thinking about it
  • Symmetric matrices: Covariance matrices, gram matrices

Lecture 6: Column Space and Nullspace

  • Critical: Column space = range of your model, nullspace = redundant features
  • ML connection: Feature selection, dimensionality reduction

Week 2 Focus: Subspaces and Projections

Lecture 9: Independence, Basis, Dimension

  • Linear independence: Which features actually add information?
  • Dimension: The true degrees of freedom in your data

Lecture 10: The Four Fundamental Subspaces

  • The Big Picture: Every matrix has four fundamental spaces
  • ML insight: Understanding these explains model behavior

Lecture 14: Orthogonal Vectors and Subspaces

  • Orthogonality: Uncorrelated features, perpendicular directions
  • Gram-Schmidt: How to orthogonalize features

Lecture 15: Projections onto Subspaces

  • The geometry of regression: Projection is least squares
  • P = A(A^T A)^(-1)A^T: Most important formula in ML

Lecture 16: Projection Matrices and Least Squares


💻 Coding Checkpoint #1: Implement Core Algorithms

Build these from scratch with NumPy:

1. Linear Regression (Three Ways)

# Method 1: Normal equations
def linear_regression_normal(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

# Method 2: QR decomposition (more stable)
def linear_regression_qr(X, y):
    Q, R = np.linalg.qr(X)
    return np.linalg.solve(R, Q.T @ y)

# Method 3: Gradient descent
def linear_regression_gd(X, y, lr=0.01, epochs=1000):
    # Implement this yourself
    pass

Compare all three on real data. Understand when each is preferred.

2. Gram-Schmidt Orthogonalization

def gram_schmidt(A):
    # Orthogonalize columns of A
    # Return Q (orthonormal) and R (upper triangular)
    pass

Test on random matrices. Verify Q^T Q = I.


Week 3-4: Eigenvalues and Matrix Factorizations

This is where linear algebra becomes powerful for ML.

🎓 Strang 18.06: Eigenvalue Lectures

Lecture 21: Eigenvalues and Eigenvectors

  • Ax = λx: The most important equation in linear algebra
  • ML connection: PCA finds eigenvectors of the covariance matrix

After watching: Compute eigenvectors of a 2x2 covariance matrix by hand. Then with NumPy. Visualize the eigenvectors overlaid on your data.


Lecture 25: Symmetric Matrices and Positive Definiteness

  • Symmetric matrices: Real eigenvalues, orthogonal eigenvectors
  • Positive definite: x^T A x > 0 (loss surfaces in optimization)
  • ML connection: Hessian matrices, convergence guarantees

Lecture 29: Singular Value Decomposition (SVD)

  • THE most important matrix factorization: A = UΣV^T
  • Works for ANY matrix (even non-square!)
  • Geometric meaning: Rotate, scale, rotate back

🎓 Strang 18.065: ML-Focused Deep Dive

Now jump to the ML-specific course. These lectures assume you know 18.06.

YouTube Playlist: https://www.youtube.com/playlist?list=PLUl4u3cNGP63oMNUHXqIUcrkS2PivhN3k

Full OCW Course Page: https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/


Lecture 1: Column Space Contains All Vectors Ax

  • Connecting column space to ML model capacity
  • What can your features represent?

Lecture 2: Multiplying and Factoring Matrices

  • LU, QR, and why factorization matters computationally
  • Real-world: Large-scale linear systems

Lecture 6: Singular Value Decomposition (SVD)

  • Computational algorithms for SVD
  • Full vs. truncated SVD
  • Applications: Compression, denoising, recommendations

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

  • In this lecture, Professor Strang reviews Principal Component Analysis (PCA), which is a major tool in understanding a matrix of data. In particular, he focuses on the Eckart-Young low rank approximation theorem.
  • PCA as maximizing variance
  • PCA as minimizing reconstruction error
  • Connection to SVD: Just eigenvectors of covariance matrix!
  • You've used sklearn.decomposition.PCA—now you know what it does

💻 Coding Checkpoint #2: SVD and PCA Applications

Project 1: Image Compression

# Load image
from PIL import Image
import numpy as np

img = np.array(Image.open('photo.jpg').convert('L'))

# Apply SVD
U, S, Vt = np.linalg.svd(img, full_matrices=False)

# Compress: Keep only top k singular values
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Calculate compression ratio and plot quality vs. size

Experiment: Plot PSNR vs. k. Where's the elbow?


Project 2: Customer Segmentation with PCA

  • Dataset: Mall Customers
  • Steps:
    1. Load customer data (age, income, spending score, etc.)
    2. Standardize features
    3. Apply PCA to reduce to 2D
    4. Visualize customers in principal component space
    5. Apply K-means clustering
    6. Interpret: What do PC1 and PC2 represent?

Compare: Clustering in original space vs. PC space.


Project 3: Recommender System

  • Dataset: MovieLens 100K
  • Build matrix factorization from scratch using SVD
  • Predict missing ratings
  • Evaluate: RMSE on held-out test set
  • Compare with sklearn's NMF (Non-negative Matrix Factorization)

Week 5-6: Optimization and Gradient Methods

You need to understand how ML models actually learn.

🎓 Strang 18.065: Optimization Sequence

Lecture 4: Eigenvalues and Eigenvectors

  • Connecting eigenvalues to optimization
  • Condition number: Why some problems are hard
  • ML connection: Loss surface curvature

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

  • Best rank-k approximation (optimal!)
  • Matrix completion problem (Netflix)
  • Low-rank structure in real data

Lecture 21: Minimizing a Function Step by Step

  • Three terms of Taylor series
  • Downhill direction from first partial derivatives
  • Newton's method uses higher derivatives (Hessian)

Lecture 22: Gradient Descent: Downhill to a Minimum

  • Intuition: Follow the slope downhill
  • Step size matters: Too large → divergence, too small → slow
  • Convex vs. non-convex landscapes

Lecture 23: Accelerating Gradient Descent (Use Momentum)

  • Momentum: Use past gradients to accelerate
  • Adaptive learning rates
  • Why modern optimizers work so well

Lecture 25: Stochastic Gradient Descent

  • Why mini-batches? Computational efficiency + noise helps escape local minima
  • SGD vs. full batch gradient descent
  • This is how neural networks train


💻 Coding Checkpoint #3: Optimization from Scratch

Build gradient descent variants:

# 1. Vanilla gradient descent
def gradient_descent(X, y, lr=0.01, epochs=1000):
    w = np.zeros(X.shape[1])
    for epoch in range(epochs):
        gradient = X.T @ (X @ w - y) / len(y)
        w -= lr * gradient
    return w

# 2. SGD with mini-batches
def sgd(X, y, lr=0.01, epochs=100, batch_size=32):
    # Implement this
    pass

# 3. Momentum
def momentum_gd(X, y, lr=0.01, momentum=0.9, epochs=1000):
    # Implement this
    pass

# 4. Adam (simplified)
def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, epochs=1000):
    # Implement this
    pass

Test on quadratic functions first, then linear regression, then logistic regression.

Visualization Project:

  • Create animated GIF showing gradient descent paths
  • Compare different optimizers on the same loss surface
  • Use contour plots to show convergence

Week 7-8: Neural Networks and Modern ML

Time to understand deep learning.

🎓 Strang 18.065: Neural Network Lectures

Lecture 26: Structure of Neural Nets for Deep Learning

  • Neural networks = composition of linear transformations + nonlinearities
  • Layers of nodes with weight matrices between layers
  • Nonlinear activation (ReLU): Negative values become zero
  • How the learning function is constructed from training data

Lecture 27: Backpropagation: Find Partial Derivatives

  • Backprop = chain rule applied to computation graph
  • Why it's called "backpropagation": Reverse mode from output to input
  • How PyTorch and TensorFlow implement autograd
  • Forward pass: compute outputs; Backward pass: compute gradients
  • Key step: Backprop + stochastic gradient descent

Lecture 32: ImageNet is a CNN, The Convolution Rule

  • ImageNet: Example of convolutional neural network (CNN)
  • Convolution as matrix multiplication (circulant/Toeplitz structure)
  • Convolution matrices have ≤ n parameters (not n²): Fewer weights to compute
  • Convolution Rule: F(c*d) = Fc times Fd (Fourier matrix)
  • Why CNNs for images: translation invariance

Lecture 33: Neural Nets and the Learning Function

  • Construction of learning function F(v) = ReLU(A₁v + b₁)
  • Universal approximation through layer composition
  • Optimizing weights (A's and b's) to minimize loss function
  • How many parameters (weights and biases) are needed?
  • Memory vs. computation tradeoffs in network architecture

💻 Final Project: Build a Neural Network from Scratch

Project: Multi-layer Perceptron for MNIST

Implement without any ML frameworks (just NumPy):

class NeuralNetwork:
    def __init__(self, layer_sizes):
        # Initialize weights and biases
        pass
    
    def forward(self, X):
        # Forward pass through all layers
        # Store activations for backprop
        pass
    
    def backward(self, X, y):
        # Compute gradients via backprop
        pass
    
    def train(self, X, y, epochs, lr, batch_size):
        # Mini-batch SGD training loop
        pass
    
    def predict(self, X):
        return self.forward(X)

Requirements:

  • 2-3 hidden layers
  • ReLU activation
  • Softmax output
  • Cross-entropy loss
  • Mini-batch SGD
  • Achieve >95% accuracy on MNIST

Then: Implement the same network in PyTorch. Compare:

  • Code complexity
  • Training speed
  • Ease of experimentation

Understand: Why frameworks exist, but also what they're doing under the hood.


Week 9-10: Advanced Topics & Specialization

Choose based on your interests.

Track A: Natural Language Processing

Focus: Embeddings, attention, transformers

Project:

  • Word embeddings: Implement word2vec from scratch (skip-gram model)
  • Document similarity using cosine distance in embedding space
  • Visualize embeddings with t-SNE or PCA

Watch:

1. Stanford CS224N (Winter 2021) - Lectures 1 & 2

Lecture 1: Introduction and Word Vectors

Lecture 2: Neural Classifiers

Covers skip-gram implementation details, optimization, and neural classifiers

These lectures give you the complete mathematical foundation for implementing word2vec from scratch, including gradient computations for backprop through skip-gram.


2. StatQuest: Word Embedding and Word2Vec, Clearly Explained


3. StatQuest: t-SNE, Clearly Explained

These three cover everything you need: the theory and math (Stanford), the intuitive understanding (Word2Vec video), and the visualization technique (t-SNE video). Total watch time is manageable at ~3-4 hours for the core content.


Track B: Computer Vision

Focus: Convolutional architectures, image processing

Project:

  • Implement basic CNN in PyTorch
  • Train on CIFAR-10 or Fashion-MNIST
  • Experiment with architecture: depth, filters, pooling
  • Visualize: What do convolutional filters learn?

Application:

  • Image classification
  • Transfer learning (use pretrained ResNet)
  • Style transfer (bonus: implement neural style transfer)

Watch:


Track C: Recommender Systems

Focus: Matrix factorization, collaborative filtering

Project: Advanced Recommender System

  • Dataset: MovieLens 1M (larger)
  • Techniques:
    1. SVD-based collaborative filtering
    2. Non-negative matrix factorization (NMF)
    3. Neural collaborative filtering
  • Metrics: RMSE, precision@k, recall@k
  • Business application: Cold start problem, diversity vs. accuracy

Reading: "Matrix Factorization Techniques for Recommender Systems" (Koren et al., 2009)


📚 Complete Resource Library

Course Homepages:

Textbooks (Free PDFs):

Supplementary:


💡 Success Strategies

Time Management (5-6 hours/week):

  • 2.5 hours: Lectures (evenings, 1.5x speed)
  • 1.5 hours: Coding projects (weekend mornings)
  • 1.5 hours: Practice problems (broken into 30-min sessions)

Learning Techniques:

  • Code immediately: After every major concept, code an example
  • Teach it: Explain to colleague or write blog post
  • Connect to work: How does this apply to your current job?
  • Build portfolio: Put projects on GitHub

Common Pitfalls:

  • ❌ Watching passively without implementing
  • ❌ Trying to understand every proof (you don't need to!)
  • ❌ Skipping the coding checkpoints
  • ❌ Not connecting concepts to practical ML

Career Applications:

  • Data Analyst → Data Scientist: You now understand the models you use
  • Software Engineer → ML Engineer: You can implement and debug ML systems
  • Product Manager: You can have technical conversations with ML teams
  • Consultant: You can assess ML solutions and vendors

Next Steps After Completion:

  1. Andrew Ng's Deep Learning Specialization (Coursera)
  2. Fast.ai course (practical deep learning)
  3. Kaggle competitions (apply your knowledge)
  4. Read ML papers (you can now understand the math!)

You're building the foundation to transition into ML roles or level up in your current position. The combination of theoretical understanding and practical skills is exactly what employers want. 🎯