academy

Linear Algebra for AI: Path 2 – Rapid Learner

Fast-track linear algebra for ML when you've seen the basics before but need to connect it all to machine learning. Linear Algebra for AI: Path 2: Beatriz (Rapid Learner).

Ernesto Diaz-Aviles

09 Dec 2025 • 8 min read

Photo by Unseen Studio / Unsplash

Path 2: Beatriz (Rapid Learner)

Meet Beatriz: You took calculus and maybe linear algebra in college, but that was 3-5 years ago and you've forgotten most of it. You're working as a data analyst, software engineer, or product manager. You're comfortable with Python, SQL, and basic statistics. You've used scikit-learn but want to actually understand what's happening. You don't need to prove theorems, but you want solid intuition and practical skills.

Core Philosophy: Rapid review of foundations, then jump straight to ML applications. Balance theory and practice. Emphasis on "when to use what."

Week 1-2: Rapid Foundation Review

You need to reactivate dormant knowledge quickly.

🎥 3Blue1Brown: Essence of Linear Algebra (Full Series)

Watch all 16 videos: Complete Playlist

Why start here: Even though you've seen this before, the geometric intuition will be new and will drastically improve your understanding.

Binge-watch strategy:

Videos 1-8: Foundational (pause and code examples)
Videos 9-12: More advanced (focus on eigenvectors, dot products)
Videos 13-16: Abstract concepts (watch for intuition, don't stress details)

Active watching:

Pause after each major concept
Implement in NumPy immediately
Create your own examples

Key concepts to cement:

Linear transformations as matrices
Matrix multiplication as composition
Determinant as scaling factor
Eigenvectors as stable directions
Dot products as projections

🎓 Strang 18.06: Rapid Core Sequence (Lectures 1-10, 14-16)

You're watching for refresh and depth, not learning from scratch.

Week 1 Focus: Systems and Spaces

Lecture 1: The Geometry of Linear Equations

Three views of Ax=b: You saw this in college, now it'll actually make sense
Code along: Solve 3x3 system both geometrically and algebraically

Lecture 2: Elimination with Matrices

Gaussian elimination: The algorithm you learned but probably forgot
Implement: Basic elimination with NumPy

Lecture 3: Multiplication and Inverse Matrices

Five ways to multiply: Focus on column and row pictures
When does inverse exist? Connect to data science (singular matrices = collinear features)

Lecture 5: Transposes, Permutations, Vector Spaces

Transpose: You use this constantly in ML without thinking about it
Symmetric matrices: Covariance matrices, gram matrices

Lecture 6: Column Space and Nullspace

Critical: Column space = range of your model, nullspace = redundant features
ML connection: Feature selection, dimensionality reduction

Week 2 Focus: Subspaces and Projections

Lecture 9: Independence, Basis, Dimension

Linear independence: Which features actually add information?
Dimension: The true degrees of freedom in your data

Lecture 10: The Four Fundamental Subspaces

The Big Picture: Every matrix has four fundamental spaces
ML insight: Understanding these explains model behavior

Lecture 14: Orthogonal Vectors and Subspaces

Orthogonality: Uncorrelated features, perpendicular directions
Gram-Schmidt: How to orthogonalize features

Lecture 15: Projections onto Subspaces

The geometry of regression: Projection is least squares
P = A(A^T A)^(-1)A^T: Most important formula in ML

Lecture 16: Projection Matrices and Least Squares

Normal equations: A^T Ax = A^T b
Connect to scikit-learn: What LinearRegression actually does

💻 Coding Checkpoint #1: Implement Core Algorithms

Build these from scratch with NumPy:

1. Linear Regression (Three Ways)

# Method 1: Normal equations
def linear_regression_normal(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

# Method 2: QR decomposition (more stable)
def linear_regression_qr(X, y):
    Q, R = np.linalg.qr(X)
    return np.linalg.solve(R, Q.T @ y)

# Method 3: Gradient descent
def linear_regression_gd(X, y, lr=0.01, epochs=1000):
    # Implement this yourself
    pass

Compare all three on real data. Understand when each is preferred.

2. Gram-Schmidt Orthogonalization

def gram_schmidt(A):
    # Orthogonalize columns of A
    # Return Q (orthonormal) and R (upper triangular)
    pass

Test on random matrices. Verify Q^T Q = I.

Week 3-4: Eigenvalues and Matrix Factorizations

This is where linear algebra becomes powerful for ML.

🎓 Strang 18.06: Eigenvalue Lectures

Lecture 21: Eigenvalues and Eigenvectors

Ax = λx: The most important equation in linear algebra
ML connection: PCA finds eigenvectors of the covariance matrix

After watching: Compute eigenvectors of a 2x2 covariance matrix by hand. Then with NumPy. Visualize the eigenvectors overlaid on your data.

Lecture 25: Symmetric Matrices and Positive Definiteness

Symmetric matrices: Real eigenvalues, orthogonal eigenvectors
Positive definite: x^T A x > 0 (loss surfaces in optimization)
ML connection: Hessian matrices, convergence guarantees

Lecture 29: Singular Value Decomposition (SVD)

THE most important matrix factorization: A = UΣV^T
Works for ANY matrix (even non-square!)
Geometric meaning: Rotate, scale, rotate back

🎓 Strang 18.065: ML-Focused Deep Dive

Now jump to the ML-specific course. These lectures assume you know 18.06.

YouTube Playlist: https://www.youtube.com/playlist?list=PLUl4u3cNGP63oMNUHXqIUcrkS2PivhN3k

Full OCW Course Page: https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/

Lecture 1: Column Space Contains All Vectors Ax

Connecting column space to ML model capacity
What can your features represent?

Lecture 2: Multiplying and Factoring Matrices

LU, QR, and why factorization matters computationally
Real-world: Large-scale linear systems

Lecture 6: Singular Value Decomposition (SVD)

Computational algorithms for SVD
Full vs. truncated SVD
Applications: Compression, denoising, recommendations

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

In this lecture, Professor Strang reviews Principal Component Analysis (PCA), which is a major tool in understanding a matrix of data. In particular, he focuses on the Eckart-Young low rank approximation theorem.
PCA as maximizing variance
PCA as minimizing reconstruction error
Connection to SVD: Just eigenvectors of covariance matrix!
You've used sklearn.decomposition.PCA—now you know what it does

💻 Coding Checkpoint #2: SVD and PCA Applications

Project 1: Image Compression

# Load image
from PIL import Image
import numpy as np

img = np.array(Image.open('photo.jpg').convert('L'))

# Apply SVD
U, S, Vt = np.linalg.svd(img, full_matrices=False)

# Compress: Keep only top k singular values
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Calculate compression ratio and plot quality vs. size

Experiment: Plot PSNR vs. k. Where's the elbow?

Project 2: Customer Segmentation with PCA

Dataset: Mall Customers
Steps:
1. Load customer data (age, income, spending score, etc.)
2. Standardize features
3. Apply PCA to reduce to 2D
4. Visualize customers in principal component space
5. Apply K-means clustering
6. Interpret: What do PC1 and PC2 represent?

Compare: Clustering in original space vs. PC space.

Project 3: Recommender System

Dataset: MovieLens 100K
Build matrix factorization from scratch using SVD
Predict missing ratings
Evaluate: RMSE on held-out test set
Compare with sklearn's NMF (Non-negative Matrix Factorization)

Week 5-6: Optimization and Gradient Methods

You need to understand how ML models actually learn.

🎓 Strang 18.065: Optimization Sequence

Lecture 4: Eigenvalues and Eigenvectors

Connecting eigenvalues to optimization
Condition number: Why some problems are hard
ML connection: Loss surface curvature

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

Best rank-k approximation (optimal!)
Matrix completion problem (Netflix)
Low-rank structure in real data

Lecture 21: Minimizing a Function Step by Step

Three terms of Taylor series
Downhill direction from first partial derivatives
Newton's method uses higher derivatives (Hessian)

Lecture 22: Gradient Descent: Downhill to a Minimum

Intuition: Follow the slope downhill
Step size matters: Too large → divergence, too small → slow
Convex vs. non-convex landscapes

Lecture 23: Accelerating Gradient Descent (Use Momentum)

Momentum: Use past gradients to accelerate
Adaptive learning rates
Why modern optimizers work so well

Lecture 25: Stochastic Gradient Descent

Why mini-batches? Computational efficiency + noise helps escape local minima
SGD vs. full batch gradient descent
This is how neural networks train

💻 Coding Checkpoint #3: Optimization from Scratch

Build gradient descent variants:

# 1. Vanilla gradient descent
def gradient_descent(X, y, lr=0.01, epochs=1000):
    w = np.zeros(X.shape[1])
    for epoch in range(epochs):
        gradient = X.T @ (X @ w - y) / len(y)
        w -= lr * gradient
    return w

# 2. SGD with mini-batches
def sgd(X, y, lr=0.01, epochs=100, batch_size=32):
    # Implement this
    pass

# 3. Momentum
def momentum_gd(X, y, lr=0.01, momentum=0.9, epochs=1000):
    # Implement this
    pass

# 4. Adam (simplified)
def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, epochs=1000):
    # Implement this
    pass

Test on quadratic functions first, then linear regression, then logistic regression.

Visualization Project:

Create animated GIF showing gradient descent paths
Compare different optimizers on the same loss surface
Use contour plots to show convergence

Week 7-8: Neural Networks and Modern ML

Time to understand deep learning.

🎓 Strang 18.065: Neural Network Lectures

Lecture 26: Structure of Neural Nets for Deep Learning

Neural networks = composition of linear transformations + nonlinearities
Layers of nodes with weight matrices between layers
Nonlinear activation (ReLU): Negative values become zero
How the learning function is constructed from training data

Lecture 27: Backpropagation: Find Partial Derivatives

Backprop = chain rule applied to computation graph
Why it's called "backpropagation": Reverse mode from output to input
How PyTorch and TensorFlow implement autograd
Forward pass: compute outputs; Backward pass: compute gradients
Key step: Backprop + stochastic gradient descent

Lecture 32: ImageNet is a CNN, The Convolution Rule

ImageNet: Example of convolutional neural network (CNN)
Convolution as matrix multiplication (circulant/Toeplitz structure)
Convolution matrices have ≤ n parameters (not n²): Fewer weights to compute
Convolution Rule: F(c*d) = Fc times Fd (Fourier matrix)
Why CNNs for images: translation invariance

Lecture 33: Neural Nets and the Learning Function

Construction of learning function F(v) = ReLU(A₁v + b₁)
Universal approximation through layer composition
Optimizing weights (A's and b's) to minimize loss function
How many parameters (weights and biases) are needed?
Memory vs. computation tradeoffs in network architecture

💻 Final Project: Build a Neural Network from Scratch

Project: Multi-layer Perceptron for MNIST

Implement without any ML frameworks (just NumPy):

class NeuralNetwork:
    def __init__(self, layer_sizes):
        # Initialize weights and biases
        pass
    
    def forward(self, X):
        # Forward pass through all layers
        # Store activations for backprop
        pass
    
    def backward(self, X, y):
        # Compute gradients via backprop
        pass
    
    def train(self, X, y, epochs, lr, batch_size):
        # Mini-batch SGD training loop
        pass
    
    def predict(self, X):
        return self.forward(X)

Requirements:

2-3 hidden layers
ReLU activation
Softmax output
Cross-entropy loss
Mini-batch SGD
Achieve >95% accuracy on MNIST

Then: Implement the same network in PyTorch. Compare:

Code complexity
Training speed
Ease of experimentation

Understand: Why frameworks exist, but also what they're doing under the hood.

Week 9-10: Advanced Topics & Specialization

Choose based on your interests.

Track A: Natural Language Processing

Focus: Embeddings, attention, transformers

Project:

Word embeddings: Implement word2vec from scratch (skip-gram model)
Document similarity using cosine distance in embedding space
Visualize embeddings with t-SNE or PCA

Watch:

1. Stanford CS224N (Winter 2021) - Lectures 1 & 2

Lecture 1: Introduction and Word Vectors

Lecture 2: Neural Classifiers

Covers skip-gram implementation details, optimization, and neural classifiers

These lectures give you the complete mathematical foundation for implementing word2vec from scratch, including gradient computations for backprop through skip-gram.

2. StatQuest: Word Embedding and Word2Vec, Clearly Explained

YouTube: https://www.youtube.com/watch?v=viZrOnJclY0
Duration: ~20 minutes
Perfect visual intuition of how word2vec actually works - complements Stanford's mathematical rigor with clear diagrams

3. StatQuest: t-SNE, Clearly Explained

YouTube: https://www.youtube.com/watch?v=NEaUSP4YerM
Duration: ~11 minutes
Essential for the visualization component of your project - explains exactly how t-SNE preserves clustering when reducing dimensions

These three cover everything you need: the theory and math (Stanford), the intuitive understanding (Word2Vec video), and the visualization technique (t-SNE video). Total watch time is manageable at ~3-4 hours for the core content.

Track B: Computer Vision

Focus: Convolutional architectures, image processing

Project:

Implement basic CNN in PyTorch
Train on CIFAR-10 or Fashion-MNIST
Experiment with architecture: depth, filters, pooling
Visualize: What do convolutional filters learn?

Application:

Image classification
Transfer learning (use pretrained ResNet)
Style transfer (bonus: implement neural style transfer)

Watch:

Track C: Recommender Systems

Focus: Matrix factorization, collaborative filtering

Project: Advanced Recommender System

Dataset: MovieLens 1M (larger)
Techniques:
1. SVD-based collaborative filtering
2. Non-negative matrix factorization (NMF)
3. Neural collaborative filtering
Metrics: RMSE, precision@k, recall@k
Business application: Cold start problem, diversity vs. accuracy

Reading: "Matrix Factorization Techniques for Recommender Systems" (Koren et al., 2009)

📚 Complete Resource Library

Course Homepages:

Textbooks (Free PDFs):

Supplementary:

Khan Academy: Linear Algebra

💡 Success Strategies

Time Management (5-6 hours/week):

2.5 hours: Lectures (evenings, 1.5x speed)
1.5 hours: Coding projects (weekend mornings)
1.5 hours: Practice problems (broken into 30-min sessions)

Learning Techniques:

Code immediately: After every major concept, code an example
Teach it: Explain to colleague or write blog post
Connect to work: How does this apply to your current job?
Build portfolio: Put projects on GitHub

Common Pitfalls:

❌ Watching passively without implementing
❌ Trying to understand every proof (you don't need to!)
❌ Skipping the coding checkpoints
❌ Not connecting concepts to practical ML

Career Applications:

Data Analyst → Data Scientist: You now understand the models you use
Software Engineer → ML Engineer: You can implement and debug ML systems
Product Manager: You can have technical conversations with ML teams
Consultant: You can assess ML solutions and vendors

Next Steps After Completion:

Andrew Ng's Deep Learning Specialization (Coursera)
Fast.ai course (practical deep learning)
Kaggle competitions (apply your knowledge)
Read ML papers (you can now understand the math!)

You're building the foundation to transition into ML roles or level up in your current position. The combination of theoretical understanding and practical skills is exactly what employers want. 🎯