Machine learning fundamentals

Notes about fundamental ML concepts for Google Cloud ML Engineering certification.

Data exploration

Perform mainly univariate and bivariate analysis during initial exploration.

In other words, explore only one or two variables at the time:

Calculate min, max and average by different categories
Correlation plots

Supervised learning

Examples of supervised learning alogrithms.

Supervised model	Use case
Boosted Tree - XGBoost	Fast and reliable boosted decision tree algorithm.
Logistic regression	Binary classification.
Linear regression	Predict a numeric value by linearly dependent features.
Support Vector Machines	Maximize the margin in decision boundary. Supports non-linearity.

Unsupervised learning

Examples of unsupervised learning alogrithms.

Unsupervised model	Use case
K-means clustering	Create K random centroids and minimize distances.
Hierarchical Clustering	Clusters in hierarchical tree.

Time-series forecasting

Examples of time-series alogrithms.

Time-series model	Use case
ARIMA	Auto-regressive model to extract trend and seasonlity.

Simulation

Simulation methods	Use case
Monte carlo	Generix approximation technique for complex environments.
Markov chain	A specific model where decisions made by knowledge in the last step.

Recommendation

Recommendation method	Use case
Matrix factorization	Recommend movies for users.
Cosine similarity	Compare closeness of same type of entities.

Reinforcement learning

An agent learns by interacting with its environment by achieving maximum reward.

Examples of reinforcement learning applications:

Optimization
Offline simulation
Real-time training
Decision making
Trial and error

In Q-Learning states and actions are presented in a matrix. Might be difficult for large number of states. All action-state pairs should be known. Deep Q Network (DQN) is a better approach that solves some of these challenges. Works also in situations that the agent has not seen before.

Deep deterministic policy gradient (DDPG) can be seen as Deep Q-learning for continuous actions spaces. DDPG is off-policy.

Proximal Policy Optimization (PPO). Aims to take maximum improvement step without performance collapse.

Transfer learning

Use knowledge from one task to solve similar task. Common in NLP solutions. Transfer learning is similar concept than embeddings.

In practice layers from an existing model can be “cut off” for the other task. This reduces training time significantly.

The closer to the input layer the cut off point is, the more generic the transferred part will be.

Negative transfer learning: When knowledge is transferred from a less related source, the target performance might be degraded.

Ensemble models

Ensembling means combining results from multiple models together. Some examples of ensemble models are Random forest, AdaBoost, gradient boost and XGBoost.

Bagging

Bootstrap Aggregating. Typicall for decision trees. Special case of model averaging. Example: Random Forest.

Boosting

Boosting. Many weak learners make a strong one. Example: XGBoost. Iteratively use previous model as input for the next cycle.

Blending

Fit the model by predictions made on hold-out dataset. Also known as stacked generalization.

Data splitting

Split	Purpose	Google AutoML default ratio
Train	Traing the model	80 %
Validate	Use as comparison against other models and configs	10 %
Test / Hold out	Final check if the model is acceptable	10 %

Validation and test dataset names are confusing. Because often it is said that the final model is validated with test set.

Data leakage

Data leakage causes similar problems than overfitting. It happens especially with time series data where training set has data that would not be available at the prediction time. Time series data must be split by time order, not randomly. If data leakage happens great evaluation metrics during training will drop significantly in production.

Cross validation and bootstrapping

Multiple pairs of train and validation datasets can be created. Eg 10 fold cross validation splits the data to 10 parts. Then each of these work as the validation set at the time. Cross validation is most helpful with limited data. Otherwise the hold out approach is fine.

Bootstrapping differs from cross validation so that samples are drawn randomly with replacement.

Nested cross validation

Traditional cross validation might cause overfitting. The reason is using limited dataset to test many different model configurations.

Run another cross validation within each cross validation fold. This should reduce overfitting.

Here is procedure for nested cross validation

Outer loop for 10 cross validation folds
    Inner loop with the training set of the current fold
        Try set of hyperparameters
    Get best model from inner loop and
    Report parameters and metrics for the best inner loop model
Choose the best hyperparameters for the model

Once the best set of hyperparameters are found, the final model can be trained with full data.

In traditional cross validation 3 folds and 50 parameter combinations means 150 different models. 10 folds in the outer loop would mean 1500 models in nested cross validation.

Nested cross validation is also known as double cross validation.

Feature engineering

Features can be following types:

Feature type	Example
Numerical	134, 257
Categorical	cat, dog
Bucketized	Numeric to categorical: 10-19, 20-29
Crossed	Vector multiplication of two features
Embedding	Condense information to vector
Hashed	Example: Convert text to word count

Feature crosses

Feature cross is a synthetic feature created by cross joining the values from two or more columns. The name feature cross comes from cross product. In practice the feature created by feature cross is a vector of combinations of all crossed values.

As an example of feature cross, you can convert separate day_of_week and hour_of_day columns to one sparse vector of length 168. Due to high dimensionality, feature crosses work best with large datasets.

There is a high chance that a well thought feature cross works better than the same features individually.

Kernel trick

Despite the funny name, it is a mathematical method. Kernel trick allows linear learning algorithms to learn non-linear functions.

SVM is an example of an alogrithm that can benefit from kernel tricks.

Multicollinearity

Some models such as linear regression regression expects features to be independent from each other. Multicollinearity happens if some the features are correlating.

Multicollinearity prevention method	How it works
Principal Component Analysis	Create reduced synthetic dimension space.
Partial Least Squares	A bit similar to PCA. Also performs the regression.
Multivariate Multiple Regression	Explain how variables behave together.

Data imbalance

Here are reference values for severity of data imbalance .

Data imbalance severity	Minor class % of all rows
Mild	20-40%
Moderate	1-20%
Extreme	1%

Fraud and anomaly detection datasets are typically heavily imbalanced.

Upweight and downsample

Consider labels with two possible categories in proportion 1:99. Google’s recommendation is to downsample and upweight in training. Distribution should not be changed in test or validation sets!

Downsampling means that majority labels are removed so much that ratio between two categories comes to 1:9. Upweight means giving the downsampled class correspding weight in the model training (11).

In similar situation I have actually upweighted the minority class which seems to be wrong approach.

Normalization

Four common methods for normalization .

Normalization methods	How it works	When to use
Scaling to a range	Eg scale between 0 and 1	Few outliers, uniform distribution
Clipping	Eg cap max value to 100	Data has extreme outliers
Log scaling	Apply logarithm to values	Power law distributions
Z-score	Computes standard deviations away from mean	Few mild outliers

Feature importance

Shapley values

Sampled shapley method produces shapley values idicating the feature importance.

Sum of Shapley values for all features comes from prediction minus a baseline value. Baseline can be something like long time average.

Shapley values are calculated for individual observations. By summing these, global interpretability can be achieved.

Integrated gradients

Neural networks might use integrated gradients for feature importances.

XRAI

XRAI can identify regions and even pixels compared to integrated gradients. The name comes from eXplanation with Ranked Area Integrals.

Encoding

Information in the features can be encoded to decrease the number of features the model uses.

Feature encoding method	Use case
PCA	Reduce number of dimension linearly by retaining most of the information.
Feature Cross	One sparse vector from multiple vectors.
Embeddings	Convert large sparse vectors of categorical data to smaller vectors.
Functional data analysis	Replace features by functions.

Feature selection

Feature selection methods that require model training:

L1 regularization
Feature importances
Recursive feature selection

Start the model training with a few chosen features. Expanding is easier than wise versa.

Embeddings

Embeddings are able to condense information in categorical data such as text. Some times the features are referred as latent space.

Embeddings have three main purposes:

Find items close to each other
Feature input in supervised task
Relationships between categories

Good starting point for embedding dimensions is the fourth root of the number of original dimensions.

They can be used to lower the feature dimensionality and transfer the information to other problems.

The final features are sometimes called latent factors.

Missing and sensitive data

These are some ways to handle missing data

Case	Imputation method
Number missing	Mean
Categorical value missing	Create a missing category, set 0 weight in prediction. Some sources advice replacing by the most frequent category.
All values missing	Delete record
Advanced logic required	Predict by another ML model

Here are some ways to treat sensitive data but still keep it usable for analytics.

Encryption method	Description	Example
Format-preserving encryption	Keep format, eg number of characters	1234 5678 -> 1124 85##
Masking	Mask all characters	1234 5678 -> #### #####
Deterministic encryption	Always produce same ciphertext from the plain text	John Smith -> Nhoj Htims (reversed)
Replacement	Replace by generic text	John Smith -> [Replaced]
K-anonymity	Preserve relevant information	Remove name, use city instead of zip code. K = How many times a similar record must exist.

Loss functions

Loss function measures the models’s accuracy during the training. Best criteria is generalization to new data.

Loss function	Use case
Cross entropy / Log loss	In binary classification. Logarithm of the loss value.
Mean squared error (MSE)	For regression.
Root Mean Squared Error (RMSE)	Square root of MSE. Result is more intuitive. Not good for classification, does not penalize appropriately.

Model evaluation and performance metrics

For example count of correct predictions or money saved. Easier to calculate and connects to business goals. Performance metrics are computed after training. The process is known as model evaluation. Here is a list of binary evaluation methods .

Method	Description
Precision	How many positive predictions were correct.
Recall	How many positive labels were captured (true positive rate).
F1 score	Harmonic mean of precision and recall.
Precision-Recall curve	Trade-off between Precision and Recall by changing the probability threshold. For imblanced datasets.
AUC in ROC	Trade-off between TP and FP rate by changing the probability threshold. For balanced datasets.
Accuracy	Ratio of correct predictions in classification.
Log loss	Take the log of predicted probability.
Confusion matrix	Matrix of actual and predicted values.

It is easy to get confused with terms true positive, false postive (type 1 error), true negative and false negative (type 2 error). Positive and negative refer to the predicted value. True and false indicate wether the prediction was correct or not.

ROC AUC

Term	Definition
ROC	Receiving Operating Characteristic. Name of the method.
AUC	Area Under Curve. Integrated area under any plotted line.

ROC AUC value can be interpreted as probability that a random positive sample is ranked higher than random negative sample by the model. This means that AUC value of 0.5 indicates random behavior.

Let’s take an example. We predict probablity that our favorite ice hockey team wins in each game of the season. Most often you would say that team wins when probability is predicted to be over 50%. But you can choose any other probability threshold such as 10% or 80%.

ROC curve plots TP and FP rates on each of these probability thresholds.

Rate	Explanation
True positives rate	Number of times win was predicted successfully per total number of wins during the season.
False positive rate	How many times win was predicted for lost games per total number of games lost.

ROC is not the best option for imbalanced datasets. In our case, if the team would win or loose almost all the games. The reason is that TP and FP rates are computed independently and they do not consider their relative volume. This is easy to visualize by observing which numbers are utilized from the confusion matrix.

Precision-Recall curve

Precision and recall are trade-offs. You can not maximize both.

Precision-recall curve is more suitable evaluation metric for imbalanced datsaets than ROC curve.

If you want to balance true and false positives, optimizing precision-recall curve at specific point might be good idea. For example precision when recall is 0.5. This gives more control than optimizing the area under the precision-recall curve.

Success metrics for business

While performance metric is technical, success metric is for business.

Google recommends ambitious success metrics. For example improve sales by 20%. This makes ML effort worthwhile.

ML models might predict different target than the business metric. The models serve as a proxy for the business goal.

ML model quality	Improvement potential	What to do
Bad	Yes	Continue development.
Good	Yes	Use in production and continue development.
Good	No	Run in production, no further potential.
Bad	No	Stop, it will never make it.

Business decisions require metrics on different probability thresholds (aka decision boundaries).

Regularization and overfitting

Regularization in machine learning refers to model’s ability to generalize to new data.

For that, model should minimize the complexity but mazimize the prediction ability at the same time. Too complex model overfits to training data and does not work for new observations.

The Occam’s razor suggest that the answer with fewest assumptions should be selected.

Here are the common regularization techniques. They calculate the model complexity and try to minimize it together with the loss. The complexity metric is the weight vector of all features.

Regularization method	Complexity metric	How it works
L1	Total sum	Shrinks some features to zero.
L2	Euclidiean distance	Shrink to a small value.
Dropout	Random	Randomly set some inputs to zero on a neural network layer.

L2 is better than L1 when you have identified already the most important features. L2 adds sum of squared parameter weights to the loss function.

Hyperparameter tuning

Hyperparameters are defined outside of the model. As comparison, model parameters are part of the internal output of the trained model.

Hyperparameter tuning should improve model performance by automatically finding the best combination. It is not a method to avoid overfitting.

Hyperparameter tuning method	Description
Grid search	Create combination of hyperparameters and train model on each combination. Slow but thorough. Not suitable for large set of hyperparameters.
Random search	Like grid search but only select random set of hyperparameter combinations. Faster but suboptimal.
Bayesian optimization	Takes into account the past evaluations to optimize the hyperparameters. Typically requires less iterations.

Baseline model

A simple model can be used as a comparison point for the advanced models.

It can be for example a linear regression, long time average or 0.5 in binary classification.

Bias-Variance dilemma

Bias is said to be underfitting and variance to equal overfitting. Bias-Variance is like a slider where the optimum is in the middle.

A biased model could always predict the same result while high varianced model would have (too) wide spectrum of predictions.

Model total error is bias plus variance. Bias is non-estimable part. It can be thought as the base error of the model as it keeps constant across different training sets. Variance is the error component that changes depending on the dataset.

Predictions

Post-processing predictions is not recommended. Multiple steps make interpretation and deployments difficult.

Average of prediction minus average of labels is said to be prediction bias. If prediction bias is high for only sections of the model, the dataset might not represent all subset adequately. Another reason is that the model is too regularized.

Parametric vs non-parametric models

Parametric model

In practice a parametric model is a function that receives parameters to produce the output. A linear regression or simple neural network is an example of this.

Non-parametric model

Non-parametric model is rather a rule based that follows a discrete policy. For example decision tree creates specific set of rules to give the answer.

Lazy learning

In lazy learning, nothing actually happens in ML model training phase. In this sense lazy learners do not need re-training. Lazy learning suits well for continuously updating datasets.

An example alogrihtm is K-nearest neighbors. It simply checks at prediction time which observations are closest to the predicted one. Another commonly used example of an lazy learner is naive Bayesian.

Machine learning fundamentals

Blog series

Data exploration

Supervised learning

Unsupervised learning

Time-series forecasting

Simulation

Recommendation

Reinforcement learning

Transfer learning

Ensemble models

Data splitting

Feature engineering

Data imbalance

Normalization

Feature importance

Encoding

Feature selection

Embeddings

Missing and sensitive data

Loss functions

Model evaluation and performance metrics

Success metrics for business

Regularization and overfitting

Hyperparameter tuning

Baseline model

Bias-Variance dilemma

Predictions

Parametric vs non-parametric models

Lazy learning

Tags of the post

Blog series navigation

You might also like

Participate to discussion

Write a new comment

Machine learning fundamentals

Blog series

Data exploration

Supervised learning

Unsupervised learning

Time-series forecasting

Simulation

Recommendation

Reinforcement learning

Transfer learning

Ensemble models

Data splitting

Feature engineering

Data imbalance

Normalization

Feature importance

Encoding

Feature selection

Embeddings

Missing and sensitive data

Loss functions

Model evaluation and performance metrics

Success metrics for business

Regularization and overfitting

Hyperparameter tuning

Baseline model

Bias-Variance dilemma

Predictions

Parametric vs non-parametric models

Lazy learning

Tags of the post

Blog series navigation

You might also like

Types of data science platforms - Workspace, MLOps or full stack?

List of data science platforms

Neural networks for natural language processing

Participate to discussion

Write a new comment

Reply to comment