Home  /  Products  /  Features  /  Machine learning via H2O
Order

Machine learning via H2O:
Ensemble decision trees StataNow

Stata’s integration of H2O machine learning provides a powerful, scalable, and user-friendly framework for applying modern machine learning techniques. Interact with an H2O cluster seamlessly within Stata to train and evaluate predictive models efficiently while leveraging Stata's extensive data management. Use the suite of h2oml commands, or let the Control Panel interface guide you through your end-to-end data-analysis process.

Learn about H2O machine learning in Stata.

Ensemble decision trees: Gradient boosting machine (GBM) and random forest

  • GBM for regression for continuous and count responses
  • GBM for binary classification
  • GBM for multiclass classification
  • Random forest for regression
  • Random forest for binary classification
  • Random forest for multiclass classification
  • Many loss functions for GBM models
  • Many encoding schemes for categorical variables
  • Monotonicity constraints on predictors in GBM models
  • Model selection using cross-validation
  • Early stopping

Hyperparameter tuning

  • Select best-performing model by tuning
    • Number of trees
    • Learning rate of each tree in GBM models
    • Learning rate decay in GBM models
    • Maximum depth of each tree
    • Minimum number of observations for splitting a leaf node
    • Sampling rate for selecting predictor subset per tree in GBM
    • Sampling value for selecting the number of predictors in random forest
    • Sampling rate for selecting observations per tree
    • Minimum node-split threshold
    • Number of histogram bins for continuous and categorical predictors
  • Many tuning metrics for regression and classification analysis
  • Two grid-search methods: Cartesian and random
  • Different early-stopping methods for random grid search

Tuning and estimation summaries

  • Display various model performance metrics
  • Summarize cross-validation results
  • Summarize results from hyperparameter grid search
  • Select the best model after performing a grid search
  • Explore alternative models after grid search
  • Compare goodness of fit for machine learning models
  • Plot score history

Model performance evaluation

  • Binary classification
    • Display a confusion matrix
    • Display threshold-based metrics
    • Produce receiver operating characteristic (ROC) curve plot
    • Produce precision–recall curve plot
  • Multiclass classification
    • Display a confusion matrix
    • Display area under the curve (AUC) and area under the precision–recall curve (AUCPR)
    • Display hit-ratio tables

Postestimation frame and estimation results

  • Define frame for postestimation analysis
  • Store and restore model estimation results

Prediction

  • Fitted values predictions after regression
  • Class predictions after classification
  • Predicted probabilities for outcome levels after classification

Machine learning explainability

  • Shapley additive explanation (SHAP) value plots for interpretability
  • SHAP beeswarm plots
  • Partial dependence plots (PDPs)
  • Individual conditional expectation (ICE) plots
  • Variable importance plots

Decision tree analysis

  • Save decision tree structures as DOT files and display rule sets

Control panel

h2o_control_panel.png

Additional resources

See New in Stata 18 to learn about what was added in Stata 18.