Notes about fundamental ML concepts for Google Cloud ML Engineering certification.
Perform mainly univariate and bivariate analysis during initial exploration.
In other words, explore only one or two variables at the time:
- Calculate min, max and average by different categories
- Correlation plots
Examples of supervised learning alogrithms.
|Supervised model||Use case|
|Boosted Tree - XGBoost||Fast and reliable boosted decision tree algorithm.|
|Logistic regression||Binary classification.|
|Linear regression||Predict a numeric value by linearly dependent features.|
|Support Vector Machines||Maximize the margin in decision boundary. Supports non-linearity.|
Examples of unsupervised learning alogrithms.
|Unsupervised model||Use case|
|K-means clustering||Create K random centroids and minimize distances.|
|Hierarchical Clustering||Clusters in hierarchical tree.|
Examples of time-series alogrithms.
|Time-series model||Use case|
|ARIMA||Auto-regressive model to extract trend and seasonlity.|
|Simulation methods||Use case|
|Monte carlo||Generix approximation technique for complex environments.|
|Markov chain||A specific model where decisions made by knowledge in the last step.|
|Recommendation method||Use case|
|Matrix factorization||Recommend movies for users.|
|Cosine similarity||Compare closeness of same type of entities.|
An agent learns by interacting with its environment by achieving maximum reward.
Examples of reinforcement learning applications:
- Offline simulation
- Real-time training
- Decision making
- Trial and error
In Q-Learning states and actions are presented in a matrix. Might be difficult for large number of states. All action-state pairs should be known. Deep Q Network (DQN) is a better approach that solves some of these challenges. Works also in situations that the agent has not seen before.
Deep deterministic policy gradient (DDPG) can be seen as Deep Q-learning for continuous actions spaces. DDPG is off-policy.
Proximal Policy Optimization (PPO). Aims to take maximum improvement step without performance collapse.
Use knowledge from one task to solve similar task. Common in NLP solutions. Transfer learning is similar concept than embeddings.
In practice layers from an existing model can be “cut off” for the other task. This reduces training time significantly.
The closer to the input layer the cut off point is, the more generic the transferred part will be.
Negative transfer learning: When knowledge is transferred from a less related source, the target performance might be degraded.
Ensembling means combining results from multiple models together. Some examples of ensemble models are Random forest, AdaBoost, gradient boost and XGBoost.
Bootstrap Aggregating. Typicall for decision trees. Special case of model averaging. Example: Random Forest.
Boosting. Many weak learners make a strong one. Example: XGBoost. Iteratively use previous model as input for the next cycle.
Fit the model by predictions made on hold-out dataset. Also known as stacked generalization.
|Split||Purpose||Google AutoML default ratio|
|Train||Traing the model||80 %|
|Validate||Use as comparison against other models and configs||10 %|
|Test / Hold out||Final check if the model is acceptable||10 %|
Validation and test dataset names are confusing. Because often it is said that the final model is validated with test set.
Data leakage causes similar problems than overfitting. It happens especially with time series data where training set has data that would not be available at the prediction time. Time series data must be split by time order, not randomly. If data leakage happens great evaluation metrics during training will drop significantly in production.
Cross validation and bootstrapping
Multiple pairs of train and validation datasets can be created. Eg 10 fold cross validation splits the data to 10 parts. Then each of these work as the validation set at the time. Cross validation is most helpful with limited data. Otherwise the hold out approach is fine.
Bootstrapping differs from cross validation so that samples are drawn randomly with replacement.
Nested cross validation
Traditional cross validation might cause overfitting. The reason is using limited dataset to test many different model configurations.
Run another cross validation within each cross validation fold. This should reduce overfitting.
Here is procedure for nested cross validation
Outer loop for 10 cross validation folds Inner loop with the training set of the current fold Try set of hyperparameters Get best model from inner loop and Report parameters and metrics for the best inner loop model Choose the best hyperparameters for the model
Once the best set of hyperparameters are found, the final model can be trained with full data.
In traditional cross validation 3 folds and 50 parameter combinations means 150 different models. 10 folds in the outer loop would mean 1500 models in nested cross validation.
Nested cross validation is also known as double cross validation.
Features can be following types:
|Bucketized||Numeric to categorical: 10-19, 20-29|
|Crossed||Vector multiplication of two features|
|Embedding||Condense information to vector|
|Hashed||Example: Convert text to word count|
Feature cross is a synthetic feature created by cross joining the values from two or more columns. The name feature cross comes from cross product. In practice the feature created by feature cross is a vector of combinations of all crossed values.
As an example of feature cross, you can convert separate
hour_of_day columns to one sparse vector of length 168. Due to high dimensionality, feature crosses work best with large datasets.
There is a high chance that a well thought feature cross works better than the same features individually.
Despite the funny name, it is a mathematical method. Kernel trick allows linear learning algorithms to learn non-linear functions.
SVM is an example of an alogrithm that can benefit from kernel tricks.
Some models such as linear regression regression expects features to be independent from each other. Multicollinearity happens if some the features are correlating.
|Multicollinearity prevention method||How it works|
|Principal Component Analysis||Create reduced synthetic dimension space.|
|Partial Least Squares||A bit similar to PCA. Also performs the regression.|
|Multivariate Multiple Regression||Explain how variables behave together.|
Here are reference values for severity of data imbalance .
|Data imbalance severity||Minor class % of all rows|
Fraud and anomaly detection datasets are typically heavily imbalanced.
Upweight and downsample
Consider labels with two possible categories in proportion
1:99. Google’s recommendation is to downsample and upweight in training. Distribution should not be changed in test or validation sets!
Downsampling means that majority labels are removed so much that ratio between two categories comes to
1:9. Upweight means giving the downsampled class correspding weight in the model training (
In similar situation I have actually upweighted the minority class which seems to be wrong approach.
Four common methods for normalization .
|Normalization methods||How it works||When to use|
|Scaling to a range||Eg scale between 0 and 1||Few outliers, uniform distribution|
|Clipping||Eg cap max value to 100||Data has extreme outliers|
|Log scaling||Apply logarithm to values||Power law distributions|
|Z-score||Computes standard deviations away from mean||Few mild outliers|
Sampled shapley method produces shapley values idicating the feature importance.
Sum of Shapley values for all features comes from prediction minus a baseline value. Baseline can be something like long time average.
Shapley values are calculated for individual observations. By summing these, global interpretability can be achieved.
Neural networks might use integrated gradients for feature importances.
XRAI can identify regions and even pixels compared to integrated gradients. The name comes from eXplanation with Ranked Area Integrals.
Information in the features can be encoded to decrease the number of features the model uses.
|Feature encoding method||Use case|
|PCA||Reduce number of dimension linearly by retaining most of the information.|
|Feature Cross||One sparse vector from multiple vectors.|
|Embeddings||Convert large sparse vectors of categorical data to smaller vectors.|
|Functional data analysis||Replace features by functions.|
Feature selection methods that require model training:
- L1 regularization
- Feature importances
- Recursive feature selection
Start the model training with a few chosen features. Expanding is easier than wise versa.
Embeddings are able to condense information in categorical data such as text. Some times the features are referred as latent space.
Embeddings have three main purposes:
- Find items close to each other
- Feature input in supervised task
- Relationships between categories
Good starting point for embedding dimensions is the fourth root of the number of original dimensions.
They can be used to lower the feature dimensionality and transfer the information to other problems.
The final features are sometimes called latent factors.
Missing and sensitive data
These are some ways to handle missing data
|Categorical value missing||Create a missing category, set 0 weight in prediction. Some sources advice replacing by the most frequent category.|
|All values missing||Delete record|
|Advanced logic required||Predict by another ML model|
Here are some ways to treat sensitive data but still keep it usable for analytics.
|Format-preserving encryption||Keep format, eg number of characters||1234 5678 -> 1124 85##|
|Masking||Mask all characters||1234 5678 -> #### #####|
|Deterministic encryption||Always produce same ciphertext from the plain text||John Smith -> Nhoj Htims (reversed)|
|Replacement||Replace by generic text||John Smith -> [Replaced]|
|K-anonymity||Preserve relevant information||Remove name, use city instead of zip code. K = How many times a similar record must exist.|
Loss function measures the models’s accuracy during the training. Best criteria is generalization to new data.
|Loss function||Use case|
|Cross entropy / Log loss||In binary classification. Logarithm of the loss value.|
|Mean squared error (MSE)||For regression.|
|Root Mean Squared Error (RMSE)||Square root of MSE. Result is more intuitive. Not good for classification, does not penalize appropriately.|
Model evaluation and performance metrics
For example count of correct predictions or money saved. Easier to calculate and connects to business goals. Performance metrics are computed after training. The process is known as model evaluation. Here is a list of binary evaluation methods .
|Precision||How many positive predictions were correct.|
|Recall||How many positive labels were captured (true positive rate).|
|F1 score||Harmonic mean of precision and recall.|
|Precision-Recall curve||Trade-off between Precision and Recall by changing the probability threshold. For imblanced datasets.|
|AUC in ROC||Trade-off between TP and FP rate by changing the probability threshold. For balanced datasets.|
|Accuracy||Ratio of correct predictions in classification.|
|Log loss||Take the log of predicted probability.|
|Confusion matrix||Matrix of actual and predicted values.|
It is easy to get confused with terms
false postive (type 1 error),
true negative and
false negative (type 2 error). Positive and negative refer to the predicted value. True and false indicate wether the prediction was correct or not.
|ROC||Receiving Operating Characteristic. Name of the method.|
|AUC||Area Under Curve. Integrated area under any plotted line.|
ROC AUC value can be interpreted as probability that a random positive sample is ranked higher than random negative sample by the model. This means that AUC value of 0.5 indicates random behavior.
Let’s take an example. We predict probablity that our favorite ice hockey team wins in each game of the season. Most often you would say that team wins when probability is predicted to be over 50%. But you can choose any other probability threshold such as 10% or 80%.
ROC curve plots TP and FP rates on each of these probability thresholds.
|True positives rate||Number of times win was predicted successfully per total number of wins during the season.|
|False positive rate||How many times win was predicted for lost games per total number of games lost.|
ROC is not the best option for imbalanced datasets. In our case, if the team would win or loose almost all the games. The reason is that TP and FP rates are computed independently and they do not consider their relative volume. This is easy to visualize by observing which numbers are utilized from the confusion matrix.
Precision and recall are trade-offs. You can not maximize both.
Precision-recall curve is more suitable evaluation metric for imbalanced datsaets than ROC curve.
If you want to balance true and false positives, optimizing precision-recall curve at specific point might be good idea. For example precision when recall is 0.5. This gives more control than optimizing the area under the precision-recall curve.
Success metrics for business
While performance metric is technical, success metric is for business.
Google recommends ambitious success metrics. For example improve sales by 20%. This makes ML effort worthwhile.
ML models might predict different target than the business metric. The models serve as a proxy for the business goal.
|ML model quality||Improvement potential||What to do|
|Good||Yes||Use in production and continue development.|
|Good||No||Run in production, no further potential.|
|Bad||No||Stop, it will never make it.|
Business decisions require metrics on different probability thresholds (aka decision boundaries).
Regularization and overfitting
Regularization in machine learning refers to model’s ability to generalize to new data.
For that, model should minimize the complexity but mazimize the prediction ability at the same time. Too complex model overfits to training data and does not work for new observations.
The Occam’s razor suggest that the answer with fewest assumptions should be selected.
Here are the common regularization techniques. They calculate the model complexity and try to minimize it together with the loss. The complexity metric is the weight vector of all features.
|Regularization method||Complexity metric||How it works|
|L1||Total sum||Shrinks some features to zero.|
|L2||Euclidiean distance||Shrink to a small value.|
|Dropout||Random||Randomly set some inputs to zero on a neural network layer.|
L2 is better than
L1 when you have identified already the most important features.
L2 adds sum of squared parameter weights to the loss function.
Hyperparameters are defined outside of the model. As comparison, model parameters are part of the internal output of the trained model.
Hyperparameter tuning should improve model performance by automatically finding the best combination. It is not a method to avoid overfitting.
|Hyperparameter tuning method||Description|
|Grid search||Create combination of hyperparameters and train model on each combination. Slow but thorough. Not suitable for large set of hyperparameters.|
|Random search||Like grid search but only select random set of hyperparameter combinations. Faster but suboptimal.|
|Bayesian optimization||Takes into account the past evaluations to optimize the hyperparameters. Typically requires less iterations.|
A simple model can be used as a comparison point for the advanced models.
It can be for example a linear regression, long time average or 0.5 in binary classification.
Bias is said to be underfitting and variance to equal overfitting. Bias-Variance is like a slider where the optimum is in the middle.
A biased model could always predict the same result while high varianced model would have (too) wide spectrum of predictions.
Model total error is bias plus variance. Bias is non-estimable part. It can be thought as the base error of the model as it keeps constant across different training sets. Variance is the error component that changes depending on the dataset.
Post-processing predictions is not recommended. Multiple steps make interpretation and deployments difficult.
Average of prediction minus average of labels is said to be prediction bias. If prediction bias is high for only sections of the model, the dataset might not represent all subset adequately. Another reason is that the model is too regularized.
Parametric vs non-parametric models
In practice a parametric model is a function that receives parameters to produce the output. A linear regression or simple neural network is an example of this.
Non-parametric model is rather a rule based that follows a discrete policy. For example decision tree creates specific set of rules to give the answer.
In lazy learning, nothing actually happens in ML model training phase. In this sense lazy learners do not need re-training. Lazy learning suits well for continuously updating datasets.
An example alogrihtm is K-nearest neighbors. It simply checks at prediction time which observations are closest to the predicted one. Another commonly used example of an lazy learner is naive Bayesian.