AI / Machine Learning
/ Deep Learning —
PhD-Level Foundations
Taught by a researcher who built ML models for nuclear nonproliferation at DTRA. Theory + implementation + research methodology — from someone who uses these tools in the field.
Linear Algebra for ML
Key Concepts
- Vectors, matrices, and tensor representations
- Dot products, projections, and norms
- Matrix multiplication and composition
- Eigendecomposition and diagonalization
- Singular Value Decomposition (SVD)
- Principal Component Analysis (PCA)
- Rank, null space, and linear independence
import numpy as np # SVD decomposition A = np.array([[3, 1], [2, 4], [0, 5]]) U, S, Vt = np.linalg.svd(A, full_matrices=False) print(f"Singular values: {S}") # Reconstruct A from SVD A_approx = U @ np.diag(S) @ Vt # PCA via SVD (zero-center first) X = np.random.randn(100, 5) X_centered = X - X.mean(axis=0) U, S, Vt = np.linalg.svd(X_centered, full_matrices=False) # First 2 principal components X_pca = X_centered @ Vt[:2].T
Common Misconceptions
Practice Problems
A weight matrix W is 512×256. An input batch has shape (32, 512). What is the output shape after the forward pass? Why does this order of multiplication matter?
Given a dataset matrix X (100 samples, 8 features), compute the covariance matrix by hand using SVD. Explain what the first singular value tells you about the data's variance.
Probability & Statistics for ML
Key Concepts
- Probability distributions (Gaussian, Bernoulli, Categorical)
- Bayes' theorem and conditional probability
- Maximum Likelihood Estimation (MLE)
- Maximum A Posteriori (MAP) estimation
- Expectation, variance, covariance
- Central Limit Theorem and its ML implications
- Hypothesis testing and p-values
from scipy import stats import numpy as np # MLE: fit a normal distribution to data data = np.random.normal(loc=5.0, scale=2.0, size=1000) mu, sigma = stats.norm.fit(data) print(f"MLE estimate: mu={mu:.3f}, sigma={sigma:.3f}") # Log-likelihood (what MLE maximizes) log_lik = np.sum(stats.norm.logpdf(data, mu, sigma)) # KL divergence between two Gaussians p = stats.norm(0, 1) q = stats.norm(0.5, 1.2) x = np.linspace(-5, 5, 1000) kl = np.trapz(p.pdf(x) * (p.logpdf(x) - q.logpdf(x)), x) print(f"KL divergence: {kl:.4f}")
Common Misconceptions
Practice Problems
You flip a biased coin 20 times and observe 14 heads. Derive the MLE estimate for p (probability of heads) from scratch using calculus. Then compute the MAP estimate assuming a Beta(2,2) prior.
A classifier outputs probabilities [0.9, 0.05, 0.05] for three classes. Compute the cross-entropy loss against the true label (class 1). Explain why cross-entropy is the negative log-likelihood of a categorical distribution.
Classical Machine Learning
Key Concepts
- Linear and logistic regression
- Decision trees and information gain
- Random forests and bagging
- Gradient boosting (XGBoost, LightGBM)
- Support Vector Machines and kernels
- K-means and EM clustering
- Cross-validation and model selection
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier( n_estimators=100, max_depth=5, random_state=42 )) ]) # 5-fold cross-validation scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc') print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}") # Feature importances pipeline.fit(X_train, y_train) importances = pipeline['clf'].feature_importances_
Common Misconceptions
Practice Problems
Explain in your own words why a decision tree without depth limits perfectly memorizes training data but generalizes poorly. What does random forest do differently to fix this?
You train a logistic regression and get 98% training accuracy but 62% validation accuracy. List three concrete things you would do next and explain the reasoning for each.
Deep Learning Foundations
Key Concepts
- Neural network architecture and layer types
- Activation functions (ReLU, sigmoid, softmax, GELU)
- Loss functions and their probabilistic interpretations
- Gradient descent (SGD, Adam, AdamW)
- Backpropagation via the chain rule
- Batch normalization and layer normalization
- Dropout, L1/L2 regularization
import torch import torch.nn as nn class MLP(nn.Module): def __init__(self, in_dim, hidden, out_dim): super().__init__() self.net = nn.Sequential( nn.Linear(in_dim, hidden), nn.BatchNorm1d(hidden), nn.ReLU(), nn.Dropout(0.3), nn.Linear(hidden, out_dim) ) def forward(self, x): return self.net(x) model = MLP(128, 256, 10) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) criterion = nn.CrossEntropyLoss() # Training step optimizer.zero_grad() logits = model(x_batch) loss = criterion(logits, y_batch) loss.backward() optimizer.step()
Common Misconceptions
Practice Problems
Derive the gradient of the cross-entropy loss with respect to the logits (before softmax). Show that it simplifies to (predicted probability - true label). Why does this make gradient descent intuitive?
You have a 10-layer network with sigmoid activations and find gradients near-zero in early layers. List two architectural changes you would make and explain why each helps numerically.
Convolutional Neural Networks
Key Concepts
- The convolution operation and discrete filters
- Kernels, feature maps, and receptive fields
- Pooling layers (max, average, adaptive)
- Stride, padding, and output dimension formulas
- Classic architectures: VGG, ResNet, EfficientNet
- Transfer learning and fine-tuning strategies
- 1D and 3D convolutions for non-image data
import torch.nn as nn import torchvision.models as models # Build a simple CNN class SimpleCNN(nn.Module): def __init__(self, n_classes): super().__init__() self.features = nn.Sequential( nn.Conv2d(3, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d((4, 4)) ) self.classifier = nn.Linear(64 * 16, n_classes) # Transfer learning: freeze backbone, train head backbone = models.resnet18(pretrained=True) for param in backbone.parameters(): param.requires_grad = False backbone.fc = nn.Linear(512, n_classes) # new head
Common Misconceptions
Practice Problems
An input image is 224×224×3. You apply a Conv2d(3, 64, kernel_size=7, stride=2, padding=3). What is the output shape? How many learnable parameters does this layer have?
You are fine-tuning ResNet-18 on a 5-class medical imaging dataset with only 800 training samples. Describe your strategy: which layers do you freeze, which do you train, and what augmentations do you apply? Justify each choice.
Transformers & LLMs
Key Concepts
- Self-attention mechanism and scaled dot-product attention
- Multi-head attention and why it works
- Positional encoding (sinusoidal and rotary)
- Encoder-only, decoder-only, encoder-decoder architectures
- Pre-training objectives (MLM, CLM, T5-style)
- Fine-tuning vs. parameter-efficient methods (LoRA, adapters)
- Prompt engineering and in-context learning
from transformers import AutoTokenizer, AutoModel import torch # Load a pretrained encoder tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") text = "Transformers changed NLP forever." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # [CLS] token embedding as sentence representation cls_embedding = outputs.last_hidden_state[:, 0, :] print(f"Embedding shape: {cls_embedding.shape}") # Attention weights (requires output_attentions=True) outputs_attn = model(**inputs, output_attentions=True) attn = outputs_attn.attentions # tuple of (batch, heads, seq, seq)
Common Misconceptions
Practice Problems
Walk through the scaled dot-product attention formula: Attention(Q, K, V) = softmax(QK³ / √d_k) V. Explain what each matrix represents, why we scale by √d_k, and what happens if we skip the scaling.
You want to fine-tune GPT-2 for a domain-specific classification task with a small labeled dataset. Compare full fine-tuning vs. LoRA. What are the trade-offs in terms of parameters updated, memory, and generalization risk?
ML Research Methods
Key Concepts
- Experimental design and controlling confounds
- Ablation studies: isolating what actually matters
- Standard benchmark datasets and their limitations
- Reproducibility: seeds, environment pinning, logging
- Reading ML papers effectively (title, abstract, results first)
- Statistical significance in model comparisons
- Compute budgets and scaling laws
Common Research Mistakes
torch.manual_seed(42), numpy.random.seed(42), and document your environment.Dr. Preston on HuggingFace
Deployed models and Spaces from real research work — including the AFOQT adaptive intelligence app built with ML.
Visit huggingface.co/Dr-P →Practice Problems
You improve a baseline model's accuracy from 84.2% to 85.1% on a test set. What additional information do you need before concluding this is a meaningful improvement? Sketch the statistical test you would run.
Design an ablation study for a transformer model that uses both multi-head attention and a novel positional encoding scheme. What variants would you train, and what would each ablation tell you about which component drives performance?
Recommended Textbooks
These are the books Dr. Preston has personally worked through. Every recommendation is field-tested in real research and tutoring sessions.
Affiliate disclosure: links use tag fissionlab-20. You pay nothing extra; commissions help keep this resource free.
The definitive graduate-level reference. Covers probability theory, optimization, regularization, CNNs, RNNs, and generative models. Dense but indispensable.
View on Amazon →The best practical introduction. Implementation-first approach — every concept is anchored in code. Excellent coverage of end-to-end ML projects.
View on Amazon →The probabilistic ML bible. Bayesian methods, graphical models, kernel machines, and the mathematical foundations that most practitioners skip. Read this for depth.
View on Amazon →Surprisingly complete for its length. Great for building intuition fast, reviewing before interviews, or getting a second perspective on algorithms you've already studied.
View on Amazon →Join the AI/ML Study Group
Bring your implementation questions, paper discussions, and debugging headaches. The Discord community includes students working through these exact topics — plus office hours with Dr. Preston.
Join the Discord Community →Free. No spam. No upsells.
Want personalized guidance?
Sessions are structured around your specific goals — whether that's passing a qual exam, understanding backprop from scratch, or getting a research project off the ground. Theory and implementation, not just intuition.
Book a Free Intro Session →