Cheat sheet for machine learning

Pro Tip: Print this out and put it on your desk or place of work for quick reference

Machine learning Algorithms

Supervised learning: Logistic Regression, Linear Regression, Neural Nets, Support Vector Machines

Unsupervised Learning: K-Means Clustering, Principal Component Analysis, Anomaly Detection

Linear vs Logistic Regression

Linear Regression: In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.

Logistic Regression: The outcome (dependent variable) has only a limited number of possible values.

When to use SVM vs Logistic Regression

n = number of features

m = number of training examples

if n is large (relative to m) = use logistic regression on SVM without a kernel (linear kernel)

e.g. n = 10,000 , m = 10–1000

if n is small, m is intermediate = use SVM with Gaussian kernel

e.g. n = 1–1000, m = 10–10,000

if n is small, m is large = use logistic regression or SVM without a kernel (linear kernel)

e.g. n = 1–1000, m = 50,000 +

Neural nets can work too but will be slower to train

Neural Nets

Number of input layers = number of features

Number of output layers = number of

Principal Component Analysis

Summarises features by giving it a new characteristic. The one of the main goal of the multivariate analyses (like PCA) is to decrease the dimensions (variables=coordinates) but keep the most of the variance of the data.

Cost/Loss function

Describes how well the model fits the data. It’s included in most algorithms.

Regularisation (e.g. Gradient Descent)

Used to solve overfitting. Add more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce the cost function.

Overfit/Underfit

Overfit means fits all points on graph, underfit means fits a few points on graph

Backpropagation

Back propagation is just a special name given to finding the gradient of the cost function in a neural network.

Cross-validation

Spitting data into training and test sets e.g. k-fold cross-validation

Errors (1&2 w/ example)

Type 1 Error: A man is pregnant

Type 2 Error: A pregnant woman is not pregnant (more dangerous)

Dimensionality reduction

Reduces number of features (loss of information)

Labels: What we are predicting

Features: Input variable describing data

Discrete Variable: Variables that can be counted

Continuous data: Variables that can be measured

Categorical variable: Categories e.g. male and female

Regression: Uses continuous values e.g. house pricing and square feet

Classification: Classifies into 2 groups e.g. Email spam or not spam

A simple linear equation is:

y =wx +b

where;

y = y-axis, x = x-axis,** w** = slope, b = y-intercept

However in machine learning;

y^ = prediction, b= bias, w= weight of feature, x = feature

Goal: Minimize the difference between y and y^

Perfect Model: Loss will be equal to 0. Loss is characterised by MSE (Mean Squared Error)

Cost: Loss for the entire dataset

MSE: Sum squared loss of all examples / Number of examples

Goal in machine learning is to reduce loss

Gradient: Derivative of a loss function.

Gradient Descent: Descending to the converged model.

A good visual explanation can be found here

Learning Rate: The rate of Gradient Descent

Hyperparameters: Things you tune to increase accuracy of model e.g.

Learning Rate, No. of Iterations, No. of Hidden Units, Activation function, regularization, Type of Gradient Descent etc.

Stochastic Gradient Descent: 1 example used at a time

Mini-batch Gradient Descent: Batch (10–1000) loss + gradient averaged over the batch (more effective than Stochastic Gradient Descent)

Optimization methods: SDG, MDD, GDM, RMRProp, Adam, LRD

Goldilocks: Optimal learning rate

Data Split: training data, validation data, test data

Feature Engineering: Extracting features from data (Note: Features should have a good meaning).

Normalization: Use log of all features for normalization

Precision Recall: Better the PR value better the model. Must look at both values precision and recall.

Prediction Bias: If we do not have a 0 bias we have a problem

Activation functions: Non-linear functions e.g. relu, leaky relu, tanh all used for different types of problems

Softmax: Multiclass classifier

Regularization: Used to avoid overfitting e.g. L1 & L2

Static model: Trained offline

Dynamic model: Trained online

Autoencoder -> Encoder & Decoder

Encoder: Converts input data into new representation

Decoder: Converts new representation back into original data

Model should have low variance and bias

Google’s helpful glossary can be found here

Back to blog

« A thought experiment on order and chaos Can a single decision alter the history of the universe? »

Ahsan Anis

Cheat sheet for machine learning