Remove: Delete the entire row (if few) or column (if many
missing).
Impute: Fill with an “informed guess”, like the mean, median (best
for outliers), or mode (best for categorical data).
Outlier Handling
Detection: Find using statistical rules (e.g., outside 3 standard
deviations) or visualization (e.g., box plots).
Removal: Delete the outlier rows. (Use with caution, as it can be
valid data).
Clipping/Capping: Set a cap/floor. (e.g., set all values > 99th
percentile to the 99th percentile value).
Log Transform: Use for skewed data to pull in high values.
Encoding
Categorical Data (text labels)
One-Hot Encoding
Converts k categories into k binary (0/1) columns.
Use for nominal data (e.g., “red”, “green”, “blue”).
Label Encoding
Converts k categories into a single integer column (1, 2, 3…).
Use for ordinal data (e.g., “low”, “medium”, “high”).
Numerical Data
Standardization (Z-score)
Rescales to mean=0, std=1
Normalization (min-max scaling)
Rescales to a fixed range [0, 1]: using (x
- \text{min}) / (\text{max} - \text{min})
Sensitive to outliers.
Cyclical Data (hours of day, months of year)
Sine, Cosine encoding: each cycling input is encoded into both a
sine and cosine component
Numerically shows that hour 23 and hour 0 are close together
Image Data
Standardize: same dimensions and format
Grey-scale: an image is a single matrix h
\times w of pixels 0 to 1
Color: an image is a tensor 3 \times h
\times w for the 3 color channels
Text
Simplify: e.g., lowercase, remove some punctuation
Tokenization: converts text into numbered subcomponents
Character-level: smaller vocabulary required, but encodes less
information
Word-level: requires large vocabulary
Sub-word: requires smaller vocabulary than word-level, but encodes
more information
One-hot encoding: tokens can be one-hot encoded, but this is
inefficient
Token embedding: each token is represented using an array of real
numbers
(Rotary) Positional embedding: incorporates the position of each
pattern in the input by rotating each token’s embedding vector by a
small angle
Split: randomly shuffle data first
Training Set (~70%): Data the model learns from.
Validation Set (~15%): Unseen data used during training to tune
hyperparameters and check overfitting.
Test Set (~15%): Unseen data used only at the end to measure final
model’s performance.
Architecture/Topology
Common Network Types
Feed-forward (MLP): General purpose for
classification/regression.
Convolutional (CNN): Used for image recognition, video, and spatial
data.
AlexNet, VGG, ResNet
Recurrent (RNN): Used for sequential data (time series, text).
has feedback connections (providing context values) from some
neurons to neurons in a previous layer of the network
trained using gradient descent but requires backpropagation through
time
Problem: Vanishing/exploding gradients.
LSTM (long short-term memory): uses output, input,
forget, and cell “gates”; solves the vanishing
gradient problem and remember long-term dependencies.
GRU (gated recurrent unit): uses reset, update and
new “gates”; solves the vanishing gradient problem and remember
long-term dependencies.
Bi-Directional RNN
Self-Organizing Feature Maps (SOM): Unsupervised, used for
clustering
Auto-encoders (AE): Unsupervised, used for dimensionality reduction
and feature learning.
trained to reproduce the inputs of the NN as the outputs
inputs go through an encoder to a small latent vector which then
goes through the decoder to the final outputs
suffers from overfitting
Variational AEs (VAE)
VAE encoder produces a Normal mean and variance value for each
element of the latent vector
discourages overfitting
can be used to generate outputs given latent vector inputs
Generative Adversarial (GANs): Used for generating new data (e.g.,
images).
Transformer: Used for natural language processing.
Self-Attention Mechanism, considers all patterns in a sequence
uses query, key and value tensors with attention calculated as A = f_\text{softmax}(QK^T)V -the QK^T tensor contains the similarities of each
pattern in the key tensor to each pattern in the query tensor
Multi-headed attention: allows applying different attention to
different parts of the input
the output of the attention mechanism is the concatenation of the
various attention heads A_i
Attention Masking: used to hide future inputs when calculating
outputs for each input
e.g., BERT, GPT
Common Layers
Input Layer: One neuron per input feature.
Hidden (Dense): Fully-connected layer; performs most of the
learning.
Usually number of neurons should be <= 2 * (# input neurons) and
> (# output neurons)
Convolutional (Conv2D): Applies filters to learn local
patterns (e.g., edges in an image).
Pooling (Max/Avg): Downsamples feature maps to reduce computational
load.
Dropout: Can be implemented as a hidden layer.
Batch Normalization: Normalizes activations within a batch to
stabilize and speed up training.
Output Layer: One neuron per output (e.g., 1 for regression, 10 for
10-class classification).
Activation Functions
ReLU (Rectified Linear Unit): \max(0,
x). Default for hidden layers.
Softplus: differentiable version of ReLU k\log(1+ e^x)
Leaky ReLU: Variant of ReLU with non-zero gradient for negative
inputs.
Tanh: Scales output to [-1, 1]. Usually used by RNNs.
Linear: y = x. Used for regression
output (to output any value).
Sigmoid: Scales output to [0, 1]. Used for binary classification
output.
Softmax: Converts outputs into a probabilities (all outputs sum to
1). Used for multi-class classification.
Training
Key Hyperparameters
Learning Rate: How big of a step the optimizer takes.
Batch Size: Number of training samples used in one optimizer
update.
Number of Epochs: Number of full passes through the training
dataset.
Network Structure: # of layers, # of neurons per layer.
Weight Initialization
Uniform / Normal: Random values.
Uniform between \pm 1/\sqrt{I}
Can cause vanishing/exploding gradients.
Xavier / Glorot: “Smart” initialization, good for Sigmoid and Tanh
activations.
He: “Smart” initialization, the default choice for ReLU
activations.
Loss Functions (measures “how wrong” the model is)
Mean Squared Error (MSE) / RMSE: For regression tasks.
Binary Cross-Entropy (BCE): For binary classification (with
Sigmoid).
Categorical Cross-Entropy: For multi-class classification (with
Softmax).
SGD + Momentum: Helps accelerate SGD and smooth out
oscillations.
RMSprop: Adaptive learning rate optimizer, good for RNNs.
Adam: (Adaptive Moment Estimation) Combines Momentum and RMSprop.
Often the best choice.
Regularization (techniques to prevent overfitting)
L1 / L2 (Weight Decay): Adds a penalty to the loss function for
large weights. L1 can lead to sparse weights (features = 0).
Dropout: Randomly turns off neurons during training to prevent
co-adaptation.
Early Stopping: Stop training when the validation loss stops
improving.
Data Augmentation: Creating more training data by modifying existing
data (e.g., rotating, flipping, or cropping images).
Training Loop
For each epoch:
Training Phase
Forward Pass: Get predictions from the model on the training
data.
Compute Loss: Compare predictions to actual values.
Backward Pass (Backpropagation): Calculate gradients (error) for all
weights.
Optimization Step: Update weights using the optimizer to reduce the
loss.
Validation Phase
Forward Pass: Get predictions from the model on the validation
data.
Compute Loss: Compare predictions to actual values.
Check if the model is overfitting (e.g., training loss down,
validation loss up).
Check if the early stopping condition is met.
Evaluation
Test Phase
Forward Pass: Get predictions from the final model on the test
data.
Compute Loss: Compare predictions to actual values.
Calculate metrics to report the model’s performance on unseen
data.
Common Metrics
For Classification:
Accuracy: % of correct predictions.
Precision: % of positive predictions that were correct.
Recall: % of actual positives that were correctly identified.
F1-Score: The harmonic mean of Precision and Recall.
Confusion Matrix: Table showing correct vs. incorrect predictions
for each class.
For Regression:
MSE / RMSE: Average squared/root-squared error (in the unit of the
output).
MAE (Mean Absolute Error): Average absolute error (easier to
interpret).
R-squared (R²): Proportion of variance in the output that the model
can predict.
Key Concepts
Gradient Descent: The core optimization algorithm. Finds the minimum
of the loss function by iteratively moving in the opposite direction of
the gradient.
Backpropagation: The algorithm used to efficiently calculate the
gradients for all weights in the network, starting from the output layer
and moving backward.
Overfitting: When the model learns the training data too well
(including its noise) and fails to generalize to new data.
Underfitting: When the model is too simple and fails to capture the
underlying patterns in the training data.