Machine Learning from Scratch: A PhD Researcher's Beginner Guide

I work as an AI engineer at the Defense Threat Reduction Agency. Before that I spent years doing physics simulations and data analysis in nuclear engineering. When people ask me how to get started with machine learning, they usually expect me to recommend a massive online course or point them to a textbook. My actual advice is different: start by understanding what ML is doing mathematically, build intuition with small examples, and only then reach for the big frameworks.

This guide is for people who have zero or near-zero ML background. We will cover the conceptual landscape, then go deep on the two most important ideas — linear regression and neural networks — with enough mathematical detail to understand what is actually happening. Python pseudocode is included throughout so you can see how concepts translate to code.

Prerequisites: High school algebra and some comfort with functions. Calculus helps but is not required to follow this guide. The section on neural networks uses derivatives conceptually — if you understand that a derivative measures slope, you will follow the gradient descent explanation.

What Machine Learning Actually Is

Here is the honest definition: machine learning is the process of fitting a mathematical function to data. That function can be simple (a line) or extraordinarily complex (a deep neural network with billions of parameters), but the core idea is the same. You have inputs, you have desired outputs, and you want a function that maps inputs to outputs in a way that generalizes to data you have not seen yet.

Traditional programming is explicit: you write rules. If the input is X, do Y. Machine learning is implicit: you show the algorithm examples and let it infer the rules. This works extremely well for problems where the rules are too complex to write explicitly — recognizing faces, translating language, predicting protein structures.

The Three Questions Every ML Project Must Answer

What is the model? What mathematical structure will learn from the data? (Linear model, decision tree, neural network, etc.)
What is the loss function? How do we measure how wrong the model's predictions are?
What is the optimizer? How do we adjust the model's parameters to reduce the loss?

Almost every ML paper you will ever read answers these three questions. Once you recognize the pattern, reading the literature becomes much easier.

Supervised vs. Unsupervised Learning

The most important distinction in ML is between supervised and unsupervised learning. The difference comes down to whether your training data includes labels.

Type	Data	Goal	Examples
Supervised	Inputs + labels	Learn input→output mapping	Image classification, spam detection, price prediction
Unsupervised	Inputs only	Find structure in data	Clustering, dimensionality reduction, anomaly detection
Reinforcement	States + rewards	Learn policy through trial and error	Game playing, robotics, autonomous systems

For beginners, supervised learning is the right place to start because the feedback signal (the label) makes training concrete and measurable. Most practical ML applications are supervised.

Linear Regression: The Foundation

Linear regression is the simplest ML model. It assumes the relationship between input x and output y is linear: y = w · x + b, where w is the weight (slope) and b is the bias (intercept). These two parameters are what the model learns from data.

The Loss Function

For regression, the standard loss is mean squared error (MSE). Given n training examples with true outputs y_i and predicted outputs ŷ_i:

# Mean Squared Error loss function
MSE = (1/n) * sum((y_true[i] - y_pred[i])**2 for i in range(n))

MSE penalizes large errors heavily (because of the square) and is zero only when every prediction is perfect. The goal of training is to find w and b that minimize MSE over the training data.

Gradient Descent

Gradient descent is the optimizer. The intuition: imagine you are standing on a hilly landscape and you want to walk to the lowest point. At each step, look around, find the direction of steepest downhill, and take a small step that way. In ML, the "landscape" is the loss function over all possible parameter values, and the gradient (the multivariable derivative) tells us which direction is "uphill." We move in the opposite direction.

# Gradient descent update rule
# learning_rate (alpha) controls step size
w = w - learning_rate * d_loss_d_w
b = b - learning_rate * d_loss_d_b

# For linear regression, the gradients are:
d_loss_d_w = (-2/n) * sum(x[i] * (y[i] - (w*x[i] + b)) for i in range(n))
d_loss_d_b = (-2/n) * sum(y[i] - (w*x[i] + b) for i in range(n))

We repeat this update many times (many epochs) until the loss stops decreasing meaningfully. The learning rate is a hyperparameter you choose — too large and the optimizer overshoots and diverges; too small and training takes forever. Choosing a good learning rate is one of the most important practical skills in ML.

Key insight: Everything in deep learning, including transformer models with hundreds of billions of parameters, uses gradient descent at its core. The specific flavor varies (Adam, SGD with momentum, RMSProp), but the fundamental idea — compute gradients, take small steps downhill — is identical to what you see here.

Classification and Logistic Regression

When the output is a category rather than a number (spam vs. not spam, cat vs. dog), we use classification. Logistic regression is the simplest classifier. It takes the linear model output and squashes it through the sigmoid function to produce a probability between 0 and 1.

# Sigmoid function
import math
def sigmoid(z):
    return 1 / (1 + math.exp(-z))

# Prediction: probability that input belongs to class 1
z = w * x + b
probability = sigmoid(z)

# Binary cross-entropy loss (for classification)
# Better than MSE for probability outputs
loss = -y * log(probability) - (1 - y) * log(1 - probability)

The binary cross-entropy loss penalizes confident wrong answers very heavily, which is the right behavior for classification. If the model says 99% probability and it is wrong, that is much worse than saying 60% probability and being wrong.

Neural Networks: Stacking Linear Models

A neural network is a sequence of linear transformations with nonlinear activation functions applied between them. That is it. The power comes from depth — stacking many layers allows the network to represent extraordinarily complex functions.

A Single Neuron

One neuron takes a vector of inputs x, computes a weighted sum plus bias, then applies an activation function f:

# Single neuron computation
def neuron(x, weights, bias, activation):
    z = sum(w * xi for w, xi in zip(weights, x)) + bias
    return activation(z)

# Common activation functions:
relu = lambda z: max(0, z)          # ReLU: most common in hidden layers
sigmoid = lambda z: 1/(1+exp(-z))  # Sigmoid: output layer (binary)
tanh = lambda z: tanh(z)           # Tanh: output range (-1, 1)

Without nonlinear activation functions, stacking layers would be mathematically equivalent to a single linear model. The activation functions are what give neural networks their ability to approximate any continuous function (the Universal Approximation Theorem).

Feedforward Network Architecture

A standard feedforward network has three types of layers. The input layer receives raw data. Hidden layers learn intermediate representations. The output layer produces the final prediction. Each layer is a collection of neurons, all receiving outputs from the previous layer.

# Pseudocode for a simple 2-layer network (1 hidden layer)
def forward_pass(x, W1, b1, W2, b2):
    # Layer 1: 3 inputs -> 4 hidden neurons
    z1 = matrix_multiply(W1, x) + b1   # shape: (4,)
    a1 = relu(z1)                      # activation

    # Layer 2: 4 hidden -> 1 output
    z2 = matrix_multiply(W2, a1) + b2  # shape: (1,)
    output = sigmoid(z2)               # probability
    return output

Backpropagation

Training a neural network means adjusting all weights and biases to minimize the loss. We still use gradient descent, but computing the gradients for each parameter requires the chain rule from calculus, applied layer by layer from the output backward through the network. This process is called backpropagation. You almost never implement it by hand — frameworks like PyTorch and TensorFlow handle it automatically through automatic differentiation — but understanding what it is doing conceptually is essential for debugging and designing networks.

Overfitting and the Bias-Variance Tradeoff

The most important concept in practical ML is the distinction between training performance and generalization performance. A model that memorizes training examples but fails on new data is useless. This is called overfitting.

Underfitting (high bias): The model is too simple to capture the true pattern. A linear model on data with a quadratic relationship will underfit.
Overfitting (high variance): The model is too complex and memorizes noise in the training data rather than learning the underlying pattern.
The goal: Find the sweet spot where the model is complex enough to capture the pattern but not so complex that it memorizes noise.

The primary tools for combating overfitting are regularization (adding a penalty term to the loss that discourages large weights), dropout (randomly disabling neurons during training), early stopping (stopping training before the validation loss starts increasing), and more data (always the best solution when available).

Practical rule: Always split your data into at least two sets: training data (used to fit the model) and validation data (used to evaluate generalization). Never touch test data until you have finalized your model design. If you evaluate on test data repeatedly, you will unconsciously overfit to it through your design choices.

Recommended Books

Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow — Aurélien Géron

The best practical ML book available. Géron walks through every major algorithm with clear intuition, real datasets, and complete code examples. The first half covers classical ML (regression, classification, SVMs, trees); the second half covers deep learning. If you read one ML book, make it this one. Updated editions keep pace with the field.

View on Amazon (affiliate link) →

Deep Learning — Goodfellow, Bengio & Courville

The canonical deep learning textbook, freely available online and also available in print. More mathematical than Géron, covering the theoretical foundations of neural networks, regularization, optimization, and specialized architectures. Read Géron first for practical intuition, then use Goodfellow et al. as your theoretical reference. The sections on gradient descent, regularization, and optimization are particularly strong.

View on Amazon (affiliate link) →

Want to Learn ML With Expert Guidance?

I offer one-on-one tutoring in machine learning and AI, from Python basics through neural network theory. My background as a USAF AI engineer means I focus on practical skills and real-world applications, not academic fluff.

Book Free 30-Min Intro →

Get Weekly AI and ML Insights

Join 500+ readers for Dr. Preston's weekly breakdown of ML concepts, career tips, and learning resources.

Subscribe Free →