Linear Algebra for AI: Path 2 – Rapid Learner
Fast-track linear algebra for ML when you've seen the basics before but need to connect it all to machine learning. Linear Algebra for AI: Path 2: Beatriz (Rapid Learner).
Path 2: Beatriz (Rapid Learner)
Meet Beatriz: You took calculus and maybe linear algebra in college, but that was 3-5 years ago and you've forgotten most of it. You're working as a data analyst, software engineer, or product manager. You're comfortable with Python, SQL, and basic statistics. You've used scikit-learn but want to actually understand what's happening. You don't need to prove theorems, but you want solid intuition and practical skills.
Core Philosophy: Rapid review of foundations, then jump straight to ML applications. Balance theory and practice. Emphasis on "when to use what."
Week 1-2: Rapid Foundation Review
You need to reactivate dormant knowledge quickly.
🎥 3Blue1Brown: Essence of Linear Algebra (Full Series)
Watch all 16 videos: Complete Playlist
Why start here: Even though you've seen this before, the geometric intuition will be new and will drastically improve your understanding.
Binge-watch strategy:
- Videos 1-8: Foundational (pause and code examples)
- Videos 9-12: More advanced (focus on eigenvectors, dot products)
- Videos 13-16: Abstract concepts (watch for intuition, don't stress details)
Active watching:
- Pause after each major concept
- Implement in NumPy immediately
- Create your own examples
Key concepts to cement:
- Linear transformations as matrices
- Matrix multiplication as composition
- Determinant as scaling factor
- Eigenvectors as stable directions
- Dot products as projections
🎓 Strang 18.06: Rapid Core Sequence (Lectures 1-10, 14-16)
You're watching for refresh and depth, not learning from scratch.
Week 1 Focus: Systems and Spaces
Lecture 1: The Geometry of Linear Equations
- Three views of Ax=b: You saw this in college, now it'll actually make sense
- Code along: Solve 3x3 system both geometrically and algebraically
Lecture 2: Elimination with Matrices
- Gaussian elimination: The algorithm you learned but probably forgot
- Implement: Basic elimination with NumPy
Lecture 3: Multiplication and Inverse Matrices
- Five ways to multiply: Focus on column and row pictures
- When does inverse exist? Connect to data science (singular matrices = collinear features)
Lecture 5: Transposes, Permutations, Vector Spaces
- Transpose: You use this constantly in ML without thinking about it
- Symmetric matrices: Covariance matrices, gram matrices
Lecture 6: Column Space and Nullspace
- Critical: Column space = range of your model, nullspace = redundant features
- ML connection: Feature selection, dimensionality reduction
Week 2 Focus: Subspaces and Projections
Lecture 9: Independence, Basis, Dimension
- Linear independence: Which features actually add information?
- Dimension: The true degrees of freedom in your data
Lecture 10: The Four Fundamental Subspaces
- The Big Picture: Every matrix has four fundamental spaces
- ML insight: Understanding these explains model behavior
Lecture 14: Orthogonal Vectors and Subspaces
- Orthogonality: Uncorrelated features, perpendicular directions
- Gram-Schmidt: How to orthogonalize features
Lecture 15: Projections onto Subspaces
- The geometry of regression: Projection is least squares
- P = A(A^T A)^(-1)A^T: Most important formula in ML
Lecture 16: Projection Matrices and Least Squares
- Normal equations: A^T Ax = A^T b
- Connect to scikit-learn: What LinearRegression actually does
💻 Coding Checkpoint #1: Implement Core Algorithms
Build these from scratch with NumPy:
1. Linear Regression (Three Ways)
# Method 1: Normal equations
def linear_regression_normal(X, y):
return np.linalg.inv(X.T @ X) @ X.T @ y
# Method 2: QR decomposition (more stable)
def linear_regression_qr(X, y):
Q, R = np.linalg.qr(X)
return np.linalg.solve(R, Q.T @ y)
# Method 3: Gradient descent
def linear_regression_gd(X, y, lr=0.01, epochs=1000):
# Implement this yourself
passCompare all three on real data. Understand when each is preferred.
2. Gram-Schmidt Orthogonalization
def gram_schmidt(A):
# Orthogonalize columns of A
# Return Q (orthonormal) and R (upper triangular)
passTest on random matrices. Verify Q^T Q = I.
Week 3-4: Eigenvalues and Matrix Factorizations
This is where linear algebra becomes powerful for ML.
🎓 Strang 18.06: Eigenvalue Lectures
Lecture 21: Eigenvalues and Eigenvectors
- Ax = λx: The most important equation in linear algebra
- ML connection: PCA finds eigenvectors of the covariance matrix
After watching: Compute eigenvectors of a 2x2 covariance matrix by hand. Then with NumPy. Visualize the eigenvectors overlaid on your data.
Lecture 25: Symmetric Matrices and Positive Definiteness
- Symmetric matrices: Real eigenvalues, orthogonal eigenvectors
- Positive definite: x^T A x > 0 (loss surfaces in optimization)
- ML connection: Hessian matrices, convergence guarantees
Lecture 29: Singular Value Decomposition (SVD)
- THE most important matrix factorization: A = UΣV^T
- Works for ANY matrix (even non-square!)
- Geometric meaning: Rotate, scale, rotate back
🎓 Strang 18.065: ML-Focused Deep Dive
Now jump to the ML-specific course. These lectures assume you know 18.06.
YouTube Playlist: https://www.youtube.com/playlist?list=PLUl4u3cNGP63oMNUHXqIUcrkS2PivhN3k
Full OCW Course Page: https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/
Lecture 1: Column Space Contains All Vectors Ax
- Connecting column space to ML model capacity
- What can your features represent?
Lecture 2: Multiplying and Factoring Matrices
- LU, QR, and why factorization matters computationally
- Real-world: Large-scale linear systems
Lecture 6: Singular Value Decomposition (SVD)
- Computational algorithms for SVD
- Full vs. truncated SVD
- Applications: Compression, denoising, recommendations
Lecture 7: Eckart-Young: The Closest Rank k Matrix to A
- In this lecture, Professor Strang reviews Principal Component Analysis (PCA), which is a major tool in understanding a matrix of data. In particular, he focuses on the Eckart-Young low rank approximation theorem.
- PCA as maximizing variance
- PCA as minimizing reconstruction error
- Connection to SVD: Just eigenvectors of covariance matrix!
- You've used sklearn.decomposition.PCA—now you know what it does
💻 Coding Checkpoint #2: SVD and PCA Applications
Project 1: Image Compression
# Load image
from PIL import Image
import numpy as np
img = np.array(Image.open('photo.jpg').convert('L'))
# Apply SVD
U, S, Vt = np.linalg.svd(img, full_matrices=False)
# Compress: Keep only top k singular values
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
# Calculate compression ratio and plot quality vs. sizeExperiment: Plot PSNR vs. k. Where's the elbow?
Project 2: Customer Segmentation with PCA
- Dataset: Mall Customers
- Steps:
- Load customer data (age, income, spending score, etc.)
- Standardize features
- Apply PCA to reduce to 2D
- Visualize customers in principal component space
- Apply K-means clustering
- Interpret: What do PC1 and PC2 represent?
Compare: Clustering in original space vs. PC space.
Project 3: Recommender System
- Dataset: MovieLens 100K
- Build matrix factorization from scratch using SVD
- Predict missing ratings
- Evaluate: RMSE on held-out test set
- Compare with sklearn's NMF (Non-negative Matrix Factorization)
Week 5-6: Optimization and Gradient Methods
You need to understand how ML models actually learn.
🎓 Strang 18.065: Optimization Sequence
Lecture 4: Eigenvalues and Eigenvectors
- Connecting eigenvalues to optimization
- Condition number: Why some problems are hard
- ML connection: Loss surface curvature
Lecture 7: Eckart-Young: The Closest Rank k Matrix to A
- Best rank-k approximation (optimal!)
- Matrix completion problem (Netflix)
- Low-rank structure in real data
Lecture 21: Minimizing a Function Step by Step
- Three terms of Taylor series
- Downhill direction from first partial derivatives
- Newton's method uses higher derivatives (Hessian)
Lecture 22: Gradient Descent: Downhill to a Minimum
- Intuition: Follow the slope downhill
- Step size matters: Too large → divergence, too small → slow
- Convex vs. non-convex landscapes
Lecture 23: Accelerating Gradient Descent (Use Momentum)
- Momentum: Use past gradients to accelerate
- Adaptive learning rates
- Why modern optimizers work so well
Lecture 25: Stochastic Gradient Descent
- Why mini-batches? Computational efficiency + noise helps escape local minima
- SGD vs. full batch gradient descent
- This is how neural networks train
💻 Coding Checkpoint #3: Optimization from Scratch
Build gradient descent variants:
# 1. Vanilla gradient descent
def gradient_descent(X, y, lr=0.01, epochs=1000):
w = np.zeros(X.shape[1])
for epoch in range(epochs):
gradient = X.T @ (X @ w - y) / len(y)
w -= lr * gradient
return w
# 2. SGD with mini-batches
def sgd(X, y, lr=0.01, epochs=100, batch_size=32):
# Implement this
pass
# 3. Momentum
def momentum_gd(X, y, lr=0.01, momentum=0.9, epochs=1000):
# Implement this
pass
# 4. Adam (simplified)
def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, epochs=1000):
# Implement this
passTest on quadratic functions first, then linear regression, then logistic regression.
Visualization Project:
- Create animated GIF showing gradient descent paths
- Compare different optimizers on the same loss surface
- Use contour plots to show convergence
Week 7-8: Neural Networks and Modern ML
Time to understand deep learning.
🎓 Strang 18.065: Neural Network Lectures
Lecture 26: Structure of Neural Nets for Deep Learning
- Neural networks = composition of linear transformations + nonlinearities
- Layers of nodes with weight matrices between layers
- Nonlinear activation (ReLU): Negative values become zero
- How the learning function is constructed from training data
Lecture 27: Backpropagation: Find Partial Derivatives
- Backprop = chain rule applied to computation graph
- Why it's called "backpropagation": Reverse mode from output to input
- How PyTorch and TensorFlow implement autograd
- Forward pass: compute outputs; Backward pass: compute gradients
- Key step: Backprop + stochastic gradient descent
Lecture 32: ImageNet is a CNN, The Convolution Rule
- ImageNet: Example of convolutional neural network (CNN)
- Convolution as matrix multiplication (circulant/Toeplitz structure)
- Convolution matrices have ≤ n parameters (not n²): Fewer weights to compute
- Convolution Rule: F(c*d) = Fc times Fd (Fourier matrix)
- Why CNNs for images: translation invariance
Lecture 33: Neural Nets and the Learning Function
- Construction of learning function F(v) = ReLU(A₁v + b₁)
- Universal approximation through layer composition
- Optimizing weights (A's and b's) to minimize loss function
- How many parameters (weights and biases) are needed?
- Memory vs. computation tradeoffs in network architecture
💻 Final Project: Build a Neural Network from Scratch
Project: Multi-layer Perceptron for MNIST
Implement without any ML frameworks (just NumPy):
class NeuralNetwork:
def __init__(self, layer_sizes):
# Initialize weights and biases
pass
def forward(self, X):
# Forward pass through all layers
# Store activations for backprop
pass
def backward(self, X, y):
# Compute gradients via backprop
pass
def train(self, X, y, epochs, lr, batch_size):
# Mini-batch SGD training loop
pass
def predict(self, X):
return self.forward(X)Requirements:
- 2-3 hidden layers
- ReLU activation
- Softmax output
- Cross-entropy loss
- Mini-batch SGD
- Achieve >95% accuracy on MNIST
Then: Implement the same network in PyTorch. Compare:
- Code complexity
- Training speed
- Ease of experimentation
Understand: Why frameworks exist, but also what they're doing under the hood.
Week 9-10: Advanced Topics & Specialization
Choose based on your interests.
Track A: Natural Language Processing
Focus: Embeddings, attention, transformers
Project:
- Word embeddings: Implement word2vec from scratch (skip-gram model)
- Document similarity using cosine distance in embedding space
- Visualize embeddings with t-SNE or PCA
Watch:
1. Stanford CS224N (Winter 2021) - Lectures 1 & 2
Lecture 1: Introduction and Word Vectors
Lecture 2: Neural Classifiers
Covers skip-gram implementation details, optimization, and neural classifiers
These lectures give you the complete mathematical foundation for implementing word2vec from scratch, including gradient computations for backprop through skip-gram.
2. StatQuest: Word Embedding and Word2Vec, Clearly Explained
- YouTube: https://www.youtube.com/watch?v=viZrOnJclY0
- Duration: ~20 minutes
- Perfect visual intuition of how word2vec actually works - complements Stanford's mathematical rigor with clear diagrams
3. StatQuest: t-SNE, Clearly Explained
- YouTube: https://www.youtube.com/watch?v=NEaUSP4YerM
- Duration: ~11 minutes
- Essential for the visualization component of your project - explains exactly how t-SNE preserves clustering when reducing dimensions
These three cover everything you need: the theory and math (Stanford), the intuitive understanding (Word2Vec video), and the visualization technique (t-SNE video). Total watch time is manageable at ~3-4 hours for the core content.
Track B: Computer Vision
Focus: Convolutional architectures, image processing
Project:
- Implement basic CNN in PyTorch
- Train on CIFAR-10 or Fashion-MNIST
- Experiment with architecture: depth, filters, pooling
- Visualize: What do convolutional filters learn?
Application:
- Image classification
- Transfer learning (use pretrained ResNet)
- Style transfer (bonus: implement neural style transfer)
Watch:
- Re-watch 18.065 Lecture 32 (CNNs)
- "CS231n: Convolutional Neural Networks for Visual Recognition" (Stanford)
Track C: Recommender Systems
Focus: Matrix factorization, collaborative filtering
Project: Advanced Recommender System
- Dataset: MovieLens 1M (larger)
- Techniques:
- SVD-based collaborative filtering
- Non-negative matrix factorization (NMF)
- Neural collaborative filtering
- Metrics: RMSE, precision@k, recall@k
- Business application: Cold start problem, diversity vs. accuracy
Reading: "Matrix Factorization Techniques for Recommender Systems" (Koren et al., 2009)
📚 Complete Resource Library
Course Homepages:
Textbooks (Free PDFs):
Supplementary:
💡 Success Strategies
Time Management (5-6 hours/week):
- 2.5 hours: Lectures (evenings, 1.5x speed)
- 1.5 hours: Coding projects (weekend mornings)
- 1.5 hours: Practice problems (broken into 30-min sessions)
Learning Techniques:
- Code immediately: After every major concept, code an example
- Teach it: Explain to colleague or write blog post
- Connect to work: How does this apply to your current job?
- Build portfolio: Put projects on GitHub
Common Pitfalls:
- ❌ Watching passively without implementing
- ❌ Trying to understand every proof (you don't need to!)
- ❌ Skipping the coding checkpoints
- ❌ Not connecting concepts to practical ML
Career Applications:
- Data Analyst → Data Scientist: You now understand the models you use
- Software Engineer → ML Engineer: You can implement and debug ML systems
- Product Manager: You can have technical conversations with ML teams
- Consultant: You can assess ML solutions and vendors
Next Steps After Completion:
- Andrew Ng's Deep Learning Specialization (Coursera)
- Fast.ai course (practical deep learning)
- Kaggle competitions (apply your knowledge)
- Read ML papers (you can now understand the math!)
You're building the foundation to transition into ML roles or level up in your current position. The combination of theoretical understanding and practical skills is exactly what employers want. 🎯