Notes about fundamental ML concepts for Google Cloud ML Engineering certification.

Data exploration

Perform mainly univariate and bivariate analysis during initial exploration.

In other words, explore only one or two variables at the time:

  • Calculate min, max and average by different categories
  • Correlation plots

Supervised learning

Examples of supervised learning alogrithms.

Supervised modelUse case
Boosted Tree - XGBoostFast and reliable boosted decision tree algorithm.
Logistic regressionBinary classification.
Linear regressionPredict a numeric value by linearly dependent features.
Support Vector MachinesMaximize the margin in decision boundary. Supports non-linearity.

Unsupervised learning

Examples of unsupervised learning alogrithms.

Unsupervised modelUse case
K-means clusteringCreate K random centroids and minimize distances.
Hierarchical ClusteringClusters in hierarchical tree.

Time-series forecasting

Examples of time-series alogrithms.

Time-series modelUse case
ARIMAAuto-regressive model to extract trend and seasonlity.

Simulation

Simulation methodsUse case
Monte carloGenerix approximation technique for complex environments.
Markov chainA specific model where decisions made by knowledge in the last step.

Recommendation

Recommendation methodUse case
Matrix factorizationRecommend movies for users.
Cosine similarityCompare closeness of same type of entities.

Reinforcement learning

An agent learns by interacting with its environment by achieving maximum reward.

Examples of reinforcement learning applications:

  • Optimization
  • Offline simulation
  • Real-time training
  • Decision making
  • Trial and error

In Q-Learning states and actions are presented in a matrix. Might be difficult for large number of states. All action-state pairs should be known. Deep Q Network (DQN) is a better approach that solves some of these challenges. Works also in situations that the agent has not seen before.

Deep deterministic policy gradient (DDPG) can be seen as Deep Q-learning for continuous actions spaces. DDPG is off-policy.

Proximal Policy Optimization (PPO). Aims to take maximum improvement step without performance collapse.

Transfer learning

Use knowledge from one task to solve similar task. Common in NLP solutions. Transfer learning is similar concept than embeddings.

In practice layers from an existing model can be “cut off” for the other task. This reduces training time significantly.

The closer to the input layer the cut off point is, the more generic the transferred part will be.

Negative transfer learning: When knowledge is transferred from a less related source, the target performance might be degraded.

Ensemble models

Ensembling means combining results from multiple models together. Some examples of ensemble models are Random forest, AdaBoost, gradient boost and XGBoost.

Bagging

Bootstrap Aggregating. Typicall for decision trees. Special case of model averaging. Example: Random Forest.

Boosting

Boosting. Many weak learners make a strong one. Example: XGBoost. Iteratively use previous model as input for the next cycle.

Blending

Fit the model by predictions made on hold-out dataset. Also known as stacked generalization.

Data splitting

SplitPurposeGoogle AutoML default ratio
TrainTraing the model80 %
ValidateUse as comparison against other models and configs10 %
Test / Hold outFinal check if the model is acceptable10 %

Validation and test dataset names are confusing. Because often it is said that the final model is validated with test set.

Data leakage

Data leakage causes similar problems than overfitting. It happens especially with time series data where training set has data that would not be available at the prediction time. Time series data must be split by time order, not randomly. If data leakage happens great evaluation metrics during training will drop significantly in production.

Cross validation and bootstrapping

Multiple pairs of train and validation datasets can be created. Eg 10 fold cross validation splits the data to 10 parts. Then each of these work as the validation set at the time. Cross validation is most helpful with limited data. Otherwise the hold out approach is fine.

Bootstrapping differs from cross validation so that samples are drawn randomly with replacement.

Nested cross validation

Traditional cross validation might cause overfitting. The reason is using limited dataset to test many different model configurations.

Run another cross validation within each cross validation fold. This should reduce overfitting.

Here is procedure for nested cross validation

Outer loop for 10 cross validation folds
    Inner loop with the training set of the current fold
        Try set of hyperparameters
    Get best model from inner loop and
    Report parameters and metrics for the best inner loop model
Choose the best hyperparameters for the model 

Once the best set of hyperparameters are found, the final model can be trained with full data.

In traditional cross validation 3 folds and 50 parameter combinations means 150 different models. 10 folds in the outer loop would mean 1500 models in nested cross validation.

Nested cross validation is also known as double cross validation.

Feature engineering

Features can be following types:

Feature typeExample
Numerical134, 257
Categoricalcat, dog
BucketizedNumeric to categorical: 10-19, 20-29
CrossedVector multiplication of two features
EmbeddingCondense information to vector
HashedExample: Convert text to word count

Feature crosses

Feature cross is a synthetic feature created by cross joining the values from two or more columns. The name feature cross comes from cross product. In practice the feature created by feature cross is a vector of combinations of all crossed values.

As an example of feature cross, you can convert separate day_of_week and hour_of_day columns to one sparse vector of length 168. Due to high dimensionality, feature crosses work best with large datasets.

There is a high chance that a well thought feature cross works better than the same features individually.

Kernel trick

Despite the funny name, it is a mathematical method. Kernel trick allows linear learning algorithms to learn non-linear functions.

SVM is an example of an alogrithm that can benefit from kernel tricks.

Multicollinearity

Some models such as linear regression regression expects features to be independent from each other. Multicollinearity happens if some the features are correlating.

Multicollinearity prevention methodHow it works
Principal Component AnalysisCreate reduced synthetic dimension space.
Partial Least SquaresA bit similar to PCA. Also performs the regression.
Multivariate Multiple RegressionExplain how variables behave together.

Data imbalance

Here are reference values for severity of data imbalance .

Data imbalance severityMinor class % of all rows
Mild20-40%
Moderate1-20%
Extreme1%

Fraud and anomaly detection datasets are typically heavily imbalanced.

Upweight and downsample

Consider labels with two possible categories in proportion 1:99. Google’s recommendation is to downsample and upweight in training. Distribution should not be changed in test or validation sets!

Downsampling means that majority labels are removed so much that ratio between two categories comes to 1:9. Upweight means giving the downsampled class correspding weight in the model training (11).

In similar situation I have actually upweighted the minority class which seems to be wrong approach.

Normalization

Four common methods for normalization .

Normalization methodsHow it worksWhen to use
Scaling to a rangeEg scale between 0 and 1Few outliers, uniform distribution
ClippingEg cap max value to 100Data has extreme outliers
Log scalingApply logarithm to valuesPower law distributions
Z-scoreComputes standard deviations away from meanFew mild outliers

Feature importance

Shapley values

Sampled shapley method produces shapley values idicating the feature importance.

Sum of Shapley values for all features comes from prediction minus a baseline value. Baseline can be something like long time average.

Shapley values are calculated for individual observations. By summing these, global interpretability can be achieved.

Integrated gradients

Neural networks might use integrated gradients for feature importances.

XRAI

XRAI can identify regions and even pixels compared to integrated gradients. The name comes from eXplanation with Ranked Area Integrals.

Encoding

Information in the features can be encoded to decrease the number of features the model uses.

Feature encoding methodUse case
PCAReduce number of dimension linearly by retaining most of the information.
Feature CrossOne sparse vector from multiple vectors.
EmbeddingsConvert large sparse vectors of categorical data to smaller vectors.
Functional data analysisReplace features by functions.

Feature selection

Feature selection methods that require model training:

  • L1 regularization
  • Feature importances
  • Recursive feature selection

Start the model training with a few chosen features. Expanding is easier than wise versa.

Embeddings

Embeddings are able to condense information in categorical data such as text. Some times the features are referred as latent space.

Embeddings have three main purposes:

  1. Find items close to each other
  2. Feature input in supervised task
  3. Relationships between categories

Good starting point for embedding dimensions is the fourth root of the number of original dimensions.

They can be used to lower the feature dimensionality and transfer the information to other problems.

The final features are sometimes called latent factors.

Missing and sensitive data

These are some ways to handle missing data

CaseImputation method
Number missingMean
Categorical value missingCreate a missing category, set 0 weight in prediction. Some sources advice replacing by the most frequent category.
All values missingDelete record
Advanced logic requiredPredict by another ML model

Here are some ways to treat sensitive data but still keep it usable for analytics.

Encryption methodDescriptionExample
Format-preserving encryptionKeep format, eg number of characters1234 5678 -> 1124 85##
MaskingMask all characters1234 5678 -> #### #####
Deterministic encryptionAlways produce same ciphertext from the plain textJohn Smith -> Nhoj Htims (reversed)
ReplacementReplace by generic textJohn Smith -> [Replaced]
K-anonymityPreserve relevant informationRemove name, use city instead of zip code. K = How many times a similar record must exist.

Loss functions

Loss function measures the models’s accuracy during the training. Best criteria is generalization to new data.

Loss functionUse case
Cross entropy / Log lossIn binary classification. Logarithm of the loss value.
Mean squared error (MSE)For regression.
Root Mean Squared Error (RMSE)Square root of MSE. Result is more intuitive. Not good for classification, does not penalize appropriately.

Model evaluation and performance metrics

For example count of correct predictions or money saved. Easier to calculate and connects to business goals. Performance metrics are computed after training. The process is known as model evaluation. Here is a list of binary evaluation methods .

MethodDescription
PrecisionHow many positive predictions were correct.
RecallHow many positive labels were captured (true positive rate).
F1 scoreHarmonic mean of precision and recall.
Precision-Recall curveTrade-off between Precision and Recall by changing the probability threshold. For imblanced datasets.
AUC in ROCTrade-off between TP and FP rate by changing the probability threshold. For balanced datasets.
AccuracyRatio of correct predictions in classification.
Log lossTake the log of predicted probability.
Confusion matrixMatrix of actual and predicted values.

It is easy to get confused with terms true positive, false postive (type 1 error), true negative and false negative (type 2 error). Positive and negative refer to the predicted value. True and false indicate wether the prediction was correct or not.

ROC AUC

TermDefinition
ROCReceiving Operating Characteristic. Name of the method.
AUCArea Under Curve. Integrated area under any plotted line.

ROC AUC value can be interpreted as probability that a random positive sample is ranked higher than random negative sample by the model. This means that AUC value of 0.5 indicates random behavior.

Let’s take an example. We predict probablity that our favorite ice hockey team wins in each game of the season. Most often you would say that team wins when probability is predicted to be over 50%. But you can choose any other probability threshold such as 10% or 80%.

ROC curve plots TP and FP rates on each of these probability thresholds.

RateExplanation
True positives rateNumber of times win was predicted successfully per total number of wins during the season.
False positive rateHow many times win was predicted for lost games per total number of games lost.

ROC is not the best option for imbalanced datasets. In our case, if the team would win or loose almost all the games. The reason is that TP and FP rates are computed independently and they do not consider their relative volume. This is easy to visualize by observing which numbers are utilized from the confusion matrix.

Precision-Recall curve

Precision and recall are trade-offs. You can not maximize both.

Precision-recall curve is more suitable evaluation metric for imbalanced datsaets than ROC curve.

If you want to balance true and false positives, optimizing precision-recall curve at specific point might be good idea. For example precision when recall is 0.5. This gives more control than optimizing the area under the precision-recall curve.

Success metrics for business

While performance metric is technical, success metric is for business.

Google recommends ambitious success metrics. For example improve sales by 20%. This makes ML effort worthwhile.

ML models might predict different target than the business metric. The models serve as a proxy for the business goal.

ML model qualityImprovement potentialWhat to do
BadYesContinue development.
GoodYesUse in production and continue development.
GoodNoRun in production, no further potential.
BadNoStop, it will never make it.

Business decisions require metrics on different probability thresholds (aka decision boundaries).

Regularization and overfitting

Regularization in machine learning refers to model’s ability to generalize to new data.

For that, model should minimize the complexity but mazimize the prediction ability at the same time. Too complex model overfits to training data and does not work for new observations.

The Occam’s razor suggest that the answer with fewest assumptions should be selected.

Here are the common regularization techniques. They calculate the model complexity and try to minimize it together with the loss. The complexity metric is the weight vector of all features.

Regularization methodComplexity metricHow it works
L1Total sumShrinks some features to zero.
L2Euclidiean distanceShrink to a small value.
DropoutRandomRandomly set some inputs to zero on a neural network layer.

L2 is better than L1 when you have identified already the most important features. L2 adds sum of squared parameter weights to the loss function.

Hyperparameter tuning

Hyperparameters are defined outside of the model. As comparison, model parameters are part of the internal output of the trained model.

Hyperparameter tuning should improve model performance by automatically finding the best combination. It is not a method to avoid overfitting.

Hyperparameter tuning methodDescription
Grid searchCreate combination of hyperparameters and train model on each combination. Slow but thorough. Not suitable for large set of hyperparameters.
Random searchLike grid search but only select random set of hyperparameter combinations. Faster but suboptimal.
Bayesian optimizationTakes into account the past evaluations to optimize the hyperparameters. Typically requires less iterations.

Baseline model

A simple model can be used as a comparison point for the advanced models.

It can be for example a linear regression, long time average or 0.5 in binary classification.

Bias-Variance dilemma

Bias is said to be underfitting and variance to equal overfitting. Bias-Variance is like a slider where the optimum is in the middle.

A biased model could always predict the same result while high varianced model would have (too) wide spectrum of predictions.

Model total error is bias plus variance. Bias is non-estimable part. It can be thought as the base error of the model as it keeps constant across different training sets. Variance is the error component that changes depending on the dataset.

Predictions

Post-processing predictions is not recommended. Multiple steps make interpretation and deployments difficult.

Average of prediction minus average of labels is said to be prediction bias. If prediction bias is high for only sections of the model, the dataset might not represent all subset adequately. Another reason is that the model is too regularized.

Parametric vs non-parametric models

Parametric model

In practice a parametric model is a function that receives parameters to produce the output. A linear regression or simple neural network is an example of this.

Non-parametric model

Non-parametric model is rather a rule based that follows a discrete policy. For example decision tree creates specific set of rules to give the answer.

Lazy learning

In lazy learning, nothing actually happens in ML model training phase. In this sense lazy learners do not need re-training. Lazy learning suits well for continuously updating datasets.

An example alogrihtm is K-nearest neighbors. It simply checks at prediction time which observations are closest to the predicted one. Another commonly used example of an lazy learner is naive Bayesian.