academy

Linear Algebra for AI: Path 3 –Theory Connector

Rigorous linear algebra for researchers with strong math backgrounds transitioning into machine learning. Linear Algebra for AI: Path 3 - Theory Connector.

Ernesto Diaz-Aviles

09 Dec 2025 • 9 min read

Photo by Kevin Ku / Unsplash

Meet Carmen: You have a postgrad degree in physics, economics, neuroscience, chemistry, or another quantitative field. You're fluent in multivariable calculus, differential equations, probability, and statistics. You've seen linear algebra, but now you need the ML-specific perspective. You want rigor—understanding assumptions, proofs, and when methods fail. You're planning to publish papers using ML or transition into ML research.

Core Philosophy: Fast-track through basics, deep dive into ML theory, heavy emphasis on connections to your field, focus on mathematical rigor and research applications.

Week 1: Rapid Comprehensive Review

You need a quick refresh, not a ground-up build.

🎥 3Blue1Brown: Essence of Linear Algebra (Selected Videos)

Watch for geometric intuition (you know the algebra):

Video 3: Linear transformations and matrices
Video 7: Inverse matrices, column space, null space
Video 14: Eigenvectors and eigenvalues
Video 15: Abstract vector spaces

Why: Refresh the geometric perspective you might have missed in your formal training.

🎓 Strang 18.06: Selective High-Level Review

Rapid watch (2x speed, pause only if unclear):

Lecture 10: The Four Fundamental Subspaces

The big picture you need for ML

Lecture 14-16: Orthogonality and Projections

Lecture 14: Orthogonal Vectors and Subspaces

Lecture 15: Projections onto Subspaces

Lecture 16: Projection Matrices and Least Squares

Critical for ML: This is the foundation of regression

Lecture 21: Eigenvalues and Eigenvectors

You know this, but watch for Strang's perspective

Lecture 25: Symmetric Matrices and Positive Definiteness

Positive definite matrices: Hessians, covariance matrices

Lecture 29: Singular Value Decomposition

SVD: The most important decomposition in ML

📖 Textbook Speed-Read

Linear Algebra and Learning from Data - Chapter 1

Read Chapter 1 in one sitting (~1 hour). Focus on:

Section I.1: Multiplication Ax using columns of A
Section I.2: Matrix-matrix multiplication AB
Section I.3: The four fundamental subspaces
Section I.6: Eigenvalues and eigenvectors
Section I.7: Singular value decomposition

Approach: Skim proofs (you can reconstruct them), focus on ML applications and insights.

Week 2-3: ML-Specific Linear Algebra Theory

Now dive into the ML-focused course with theoretical rigor.

🎓 Strang 18.065: Theoretical Foundations (Lectures 1-7)

Lecture 1: The Column Space of A Contains All Vectors Ax

Model capacity and representational power
Connection: Universal approximation vs. sample complexity

Lecture 2: Multiplying and Factoring Matrices

Computational complexity of factorizations
When to use LU vs. QR vs. Cholesky

Lecture 3: Orthonormal Columns in Q Give Q'Q = I

Gram-Schmidt and its numerical instability
Modified Gram-Schmidt, Householder reflections

Lecture 4: Eigenvalues and Eigenvectors

Spectral theorem for symmetric matrices
Diagonalization and its limitations

Lecture 5: Positive Definite and Semidefinite Matrices

Tests for positive definiteness
Connection to convex optimization

Lecture 6: Singular Value Decomposition (SVD)

Critical: Full mathematical development
Geometric interpretation: rotation-scaling-rotation
Computing SVD: Golub-Reinsch algorithm

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

Best low-rank approximation
Foundation for PCA and dimensionality reduction

📖 Textbook Deep Dive

Linear Algebra and Learning from Data - Chapters 2-3

Chapter 2: Large Matrices

Focus on computational complexity
Sparse matrices and iterative methods
Randomized algorithms

Chapter 3: Low Rank Approximation

Eckart-Young theorem (full proof)
Nuclear norm minimization
Matrix completion theory

Read actively: Work through all proofs. Pause and complete them before reading Strang's version.

💻 Theoretical Implementation Exercise

Project: Prove and Implement Eckart-Young Theorem

Prove (on paper): Truncated SVD minimizes Frobenius norm error
Implement: Numerical verification

python

   def verify_eckart_young(A, k):
       # Compute optimal rank-k approximation via SVD
       U, S, Vt = np.linalg.svd(A)
       A_k = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
       
       # Try random rank-k matrices, show SVD is better
       for trial in range(1000):
           B_k = random_rank_k_matrix(A.shape, k)
           assert frobenius_norm(A - A_k) <= frobenius_norm(A - B_k)

Analyze: How does error decay with k? Plot singular value spectrum.
Application: Matrix completion on MovieLens data

Week 3-4: Optimization Theory for ML

You need rigorous understanding of how learning algorithms work.

🎓 Strang 18.065: Optimization Sequence (Lectures 17-20)

Lecture 21: Minimizing a Function Step by Step

Introduction to iterative optimization methods
Newton's method and its variants
Trade-offs between convergence speed and computation

Lecture 22: Gradient Descent: Downhill to a Minimum

Gradient descent on convex functions: convergence guarantees
Lipschitz continuity and strong convexity
Convergence rates: O(1/k) vs. O(e^(-k))
After watching:
- Prove convergence for strongly convex functions (standard result, good exercise)
- Connect to your field: If you're from physics, link to minimizing energy functionals; if econ, to optimization in markets

Lecture 23: Accelerating Gradient Descent (Use Momentum)

Momentum: Heavy ball method from mechanics
Nesterov acceleration: Optimal first-order method
Your physics/applied math background is an advantage here

Lecture 25: Stochastic Gradient Descent

Mini-batch sampling and variance reduction
Convergence in expectation
Why SGD escapes local minima (exploration due to noise)
Learning rate schedules
Adaptive methods: AdaGrad, RMSprop, Adam
Research connection:
- Statistical mechanics interpretation (Langevin dynamics)
- Compare to Monte Carlo methods from your field

📖 Reading: Convex Optimization

Supplementary Text: Boyd & Vandenberghe, "Convex Optimization" (available free online). Free PDF 📖: Convex Optimization by Boyd & Vandenberghe

Read selected sections:

Chapter 2: Convex sets
Chapter 3: Convex functions
Chapter 9: Unconstrained minimization (focus on gradient descent)
Chapter 10: Equality constrained minimization (Lagrange multipliers)

Why: Most ML loss functions are convex (linear regression, logistic regression, SVM). Understanding convex optimization deeply will make you a better researcher.

💻 Research-Level Implementation

Project: Optimization Algorithm Comparison

Implement from scratch and rigorously analyze:

Gradient Descent (full batch)
SGD with constant and decaying learning rates
Momentum (heavy ball)
Nesterov Accelerated Gradient
Adam

Test on:

Quadratic functions (analyze eigenvalue spectrum effects)
Logistic regression (convex but not quadratic)
Simple neural network (non-convex)

Analysis Requirements:

Convergence plots (log-scale loss vs. iteration)
Theoretical vs. empirical convergence rates
Sensitivity to hyperparameters (learning rate, momentum)
Effect of condition number on convergence

Writing: Produce a technical report with:

Mathematical definitions of each optimizer
Convergence guarantees (with proofs or citations)
Empirical results
Discussion of when each method excels

Week 5-6: Advanced Matrix Methods and Probabilistic Interpretations

🎓 Strang 18.065: Advanced Topics (Lectures 8-12)

Lecture 8: Norms of Vectors and Matrices

Vector norms: L1, L2, L∞
Matrix norms: Operator norm, Frobenius norm, nuclear norm
Condition number and numerical stability

Lecture 9: Four Ways to Solve Least Squares Problems

Normal equations vs. QR vs. SVD vs. iterative
Numerical stability analysis
When each method is preferred

Lecture 10: Survey of Difficulties with Ax = b

Ill-conditioned systems
Regularization: Ridge, Lasso, Elastic Net
Statistical interpretation: prior distributions
Connection to your field:
- Physics: Regularization as adding energy penalty
- Economics: Ridge regression as Bayesian prior (Gaussian)
- Neuroscience: Sparse coding (L1 regularization)

Lecture 11: Minimizing ‖x‖ Subject to Ax = b

Constrained optimization
Lagrange multipliers
Kernel methods and RKHS (Reproducing Kernel Hilbert Space)

Lecture 12: Computing Eigenvalues and Singular Values

Power method and inverse iteration
QR algorithm for eigenvalues
Practical algorithms for SVD computation

📖 Textbook Study

Linear Algebra and Learning from Data - Chapters 4-5

Chapter 4: Eigenvalues and Eigenvectors (Applications)

Google PageRank
Markov chains and steady states
Graph Laplacians

Chapter 5: Computations with Large Matrices

Randomized SVD
Sketching and sampling methods
When to use iterative vs. direct solvers

Deep read: Don't rush. Work through computational examples.

💻 Research Application: Your Field × Linear Algebra

Choose a project connecting ML to your domain:

Physics:

Dimensionality reduction in high-dimensional phase spaces
PCA for quantum state tomography
Matrix product states (tensor decomposition)

Economics:

Factor models in finance (PCA on returns)
Network analysis of trade flows (spectral methods)
Causal inference with high-dimensional controls (Lasso)

Neuroscience:

Dimensionality reduction of neural activity (PCA, demixed PCA)
Connectivity matrices (graph theory, spectral analysis)
Decoding neural activity (linear classifiers, regularization)

Chemistry:

Molecular dynamics: PCA on conformational space
Quantum chemistry: Matrix diagonalization
Spectroscopy: Signal decomposition (NMF, ICA)

Ecology/Biology:

Species distribution modeling (matrix factorization)
Gene expression analysis (SVD, NMF)
Phylogenetic trees (distance matrices, spectral methods)

Deliverable: Technical write-up (5-10 pages) with:

Problem formulation
Linear algebra methods applied
Results and interpretation
Connection to existing literature in your field

Week 7-8: Deep Learning Theory

Now you're ready for neural networks with theoretical rigor.

🎓 Strang 18.065: Neural Network Theory (Lectures 26-27, 32-33)

Lecture 26: Structure of Neural Nets for Deep Learning

Neural network architecture fundamentals
Activation functions and their properties
Layer composition and depth

Lecture 27: Backpropagation: Find Partial Derivatives

Chain rule for compositions
Computational graph
Automatic differentiation: forward vs. reverse mode
Mathematical exercise:
- Derive backprop equations for a 3-layer network by hand
- Understand Jacobian matrices for each layer
- Connect to adjoint methods (if you know optimal control)

Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule

Convolution as Toeplitz matrix multiplication
Translation equivariance
Parameter sharing reduces model complexity

Theoretical question: Why do CNNs work so well on images? (Translation invariance, local receptive fields, hierarchical features)

Lecture 33: Neural Nets and the Learning Function

Universal approximation theorem
Depth vs. width: Expressivity results
Sample complexity: How much data do you need?

📖 Advanced Reading: Deep Learning Theory

Textbook: "Deep Learning" by Goodfellow, Bengio, Courville

Read:

Chapter 6: Deep Feedforward Networks
Chapter 8: Optimization for Training Deep Models
Chapter 7: Regularization for Deep Learning
Chapter 15: Representation Learning (if interested in theory)

Approach: Mathematical rigor. Work through derivations.

📄 Paper Reading

Required Papers (foundational deep learning theory):

"Understanding Deep Learning Requires Rethinking Generalization" (Zhang et al., 2017)
- Deep networks can memorize random labels
- Challenges traditional statistical learning theory
"The Loss Surfaces of Multilayer Networks" (Choromanska et al., 2015)
- Why local minima aren't a problem in practice
- Random matrix theory applied to neural networks
"Visualizing the Loss Landscape of Neural Nets" (Li et al., 2018)
- Empirical analysis of loss surface geometry

For each paper:

Read carefully (2-3 hours per paper)
Summarize main theoretical contributions
Implement key experiments if feasible
Critique: What assumptions are made? Are they justified?

Week 9-10: Specialization and Research Project

Choose Your Research Direction

Option A: Advanced Optimization Theory

Focus: Second-order methods, natural gradients, Fisher information

Additional Lectures:

Review 18.065 Lecture 21: Minimizing a Function Step by Step. In particular Newton's method using second-order derivatives (Hessian matrix)
The trade-off between Newton's method (faster convergence but higher computational cost) and gradient descent

Project:

Implement natural gradient descent
Compare with SGD and Adam on neural network training
Analyze Fisher Information Matrix structure
Connection: Information geometry, Riemannian optimization

Deliverable: Research-quality technical report

Option B: Probabilistic Methods and Bayesian Deep Learning

Focus: Uncertainty quantification, variational inference

Reading:

"Weight Uncertainty in Neural Networks" (Blundell et al., 2015)
"Dropout as a Bayesian Approximation" (Gal & Ghahramani, 2016)

Project:

Implement Bayesian neural network
Compare point estimates vs. posterior distributions
Analyze predictive uncertainty
Application: Active learning, OOD detection

Your advantage: Strong probability/statistics background

Option C: Graph Neural Networks and Spectral Methods

Focus: Graph Laplacians, spectral graph theory

Watch:

18.065 Lecture 23: Graph and Network Frameworks

Reading:

"Spectral Networks and Deep Locally Connected Networks on Graphs" (Bruna et al., 2014)
"Semi-Supervised Classification with Graph Convolutional Networks" (Kipf & Welling, 2017)

Project:

Implement graph convolutional network
Apply to citation network (Cora) or molecular data
Analyze: How does spectral filtering work?

Connection:

Physics: Many-body systems, quantum chemistry
Economics: Network effects, social networks
Neuroscience: Brain connectivity

Option D: Matrix Completion and Low-Rank Methods

Focus: Theoretical guarantees, nuclear norm minimization

Reading:

"Exact Matrix Completion via Convex Optimization" (Candès & Recht, 2009)
"A Singular Value Thresholding Algorithm for Matrix Completion" (Cai et al., 2010)

Project:

Implement matrix completion algorithms
Analyze incoherence conditions
Application: Recommender systems, missing data imputation

Theory:

When is exact recovery possible?
Sample complexity bounds
Connection to compressed sensing

💻 Capstone: Publish-Quality Research

Goal: Produce work suitable for submission to workshop or conference

Structure:

Introduction: Motivation, problem formulation
Background: Related work, theoretical foundations
Methods: Your approach, algorithms, theoretical contributions
Experiments: Rigorous empirical validation
Results: Analysis, discussion, limitations
Conclusion: Contributions, future work

Length: 6-8 pages (conference short paper format)

Formatting: Use LaTeX, follow NeurIPS/ICML/ICLR template

Peer Review: Share with colleagues or post on arXiv for feedback

📚 Complete Resource Compilation

Primary Courses

MIT 18.06 - Linear Algebra:

MIT 18.065 - Matrix Methods in ML:

Textbooks

Gilbert Strang:

Optimization:

Convex Optimization (Boyd & Vandenberghe)

Deep Learning:

Deep Learning (Goodfellow, Bengio, Courville)

Supplementary

3Blue1Brown:

Essence of Linear Algebra

🎯 Research Career Strategies

Publishing Path:

Workshops: NeurIPS, ICML, ICLR workshops (less competitive, good for first papers)
Short papers: 4-page format at some conferences
Main conferences: NeurIPS, ICML, ICLR, AAAI (highly competitive)
Journals: JMLR, Machine Learning, Pattern Recognition

Skill Development:

Code release: All papers should have public code (GitHub)
Reproducibility: Document experiments meticulously
Writing: Study well-written papers, practice technical writing
Presentation: Learn to give clear research talks

Networking:

arXiv: Post preprints, get early feedback
Twitter/X: ML community is active (follow researchers in your area)
Conferences: Attend virtually or in-person, network
Collaborations: Find collaborators with complementary skills

Transition Strategies:

From Academia to ML Research:

Postdoc in ML lab (if continuing in academia)
Research scientist at DeepMind, OpenAI, FAIR, Google Brain, Anthropic
Applied scientist at companies with research arms (Microsoft Research, Amazon Science)

From Industry to ML:

ML research engineer roles
Publish 1-2 strong papers first
Demonstrate both theory and implementation skills

Staying in Your Field but Adding ML:

Become the "ML person" in your department
Collaborate with CS/ML researchers
Apply ML to domain-specific problems (huge opportunity!)

💡 Advanced Study Tips

Proof Techniques:

Always attempt proofs before reading solutions
Understand proof strategies (contradiction, induction, construction)
Build intuition first, then formalize

Paper Reading:

Budget 3-4 hours per paper for deep reading
Take detailed notes
Implement key algorithms
Critique assumptions and limitations

Mathematical Maturity:

You have this! Use it.
Connect new concepts to things you already know
Don't accept hand-waving—demand rigor

Balancing Theory and Practice:

Theory without implementation is incomplete
Implementation without theory is fragile
Do both, always

Time Investment (6-8 hours/week):

2 hours: Lectures (can go faster, 1.5-2x speed)
2 hours: Reading papers/textbooks
3 hours: Implementation and research
1 hour: Writing and documentation

Your quantitative background is a massive advantage. Many ML practitioners lack deep mathematical training—you can bring rigor and theoretical insight that's sorely needed. Go publish! 🚀