Introduction to integration with H2O

Below, we provide an introduction to the H2O integration with Stata and discuss how it works.

What is H2O?

H2O is a scalable and distributed machine learning and predictive analytics platform. You can perform in-memory data analysis and machine learning using this framework.

H2O is an open-source platform, and its core code is written in Java. Stata is using its Java API and REST API to connect to H2O. More information about the H2O framework and its various machine learning algorithms can be found on the H2O website at http://docs.h2o.ai/. You can also refer to the User Guide for more information.

How does it work from within Stata?

You can either start a new H2O cluster or connect to an existing H2O cluster from within Stata. Then you may use a suite of Stata commands to interact with the H2O cluster.

Start a local H2O cluster

You can start an H2O cluster by typing h2o init in Stata on your local machine. By default, it will first check whether there is a cluster running at localhost:54321 on your local machine. The IP for the localhost is 127.0.0.1 and the port is 54321. If a cluster is not found, it will attempt to start a new H2O cluster at this location. When the cluster has been successfully initialized, you will get a summary of the H2O cluster status similar to the following:

. h2o init
Connecting to the H2O cluster running at http://127.0.0.1:54321.....not found.
Starting a new cluster running at http://127.0.0.1:54321
------------------------------------------------------------------------------
H2O cluster uptime:        2 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.36.0.1
H2O cluster version age:   12 days
H2O cluster total nodes:   1
H2O cluster free memory:   4 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        accepting new members, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------

Note that h2o init accepts some options for customizing how the H2O cluster is initialized. For example, you can specify the nthreads() option to set the maximum number of parallel threads to use when launching the H2O cluster. See h2o init for more information.

If there is already an H2O cluster running on your local machine, h2o init will attempt to connect to it. If you explicitly specify the IP and port of a remote machine when calling h2o init, by using the ip() and port() options, it will attempt to connect to the H2O cluster running on the remote machine, if there is one. This is the same as calling h2o connect. See Connect to an existing H2O cluster for more details.

Connect to an existing H2O cluster

Another way to interact with H2O is to connect to an existing H2O cluster. This is done by calling h2o connect. By default, it will attempt to connect to a cluster running at localhost:54321 on your local machine. If the connection is built successfully, you will get a summary of the cluster status similar to the following:

. h2o connect
Connecting to the H2O cluster running at http://127.0.0.1:54321. Successful.
------------------------------------------------------------------------------
H2O cluster uptime:        7 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.34.0.7
H2O cluster version age:   20 days
H2O cluster total nodes:   1
H2O cluster free memory:   7.982 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        locked, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------
Warning: the H2O version of the remote cluster 3.34.0.7 does not match the H2O
> version 3.36.0.1 that Stata shipped.

You can also connect to an H2O cluster running on a remote machine by specifying its IP and port in the ip() and port() options, respectively.

When connecting to an existing H2O cluster, a new Stata H2O session is created between the Stata client and the H2O cluster. Multiple clients can be connecting to the H2O cluster at the same time, and they will all share its resources, such as the data and models within the cluster.

Once you have successfully established a connection, h2o connect will compare the version of H2O running on the remote cluster with the H2O version shipped with Stata. If there is a mismatch, a warning is displayed underneath the summary report to indicate the differences. There is no need to panic if you see the warning; you can still interact with the remote cluster. This will only cause problems when there are changes in the REST API between the two versions, and those changes could cause a failure on the client side. You can suppress the version check by specifying the novercheck option with h2o connect.

Interact with the H2O cluster

Once the H2O cluster is up, you can interact with the H2O cluster from within Stata. For example, you can type h2o query to check the status of the cluster at any time.

. h2o query
Cluster is running at http://127.0.0.1:54321.
------------------------------------------------------------------------------
H2O cluster uptime:        50 secs
H2O cluster timezone:      America/Chicago
H2O data parsing timezone: UTC
H2O cluster version:       3.36.0.1
H2O cluster version age:   12 days
H2O cluster total nodes:   1
H2O cluster free memory:   4 Gb
H2O cluster total cores:   24
H2O cluster allowed cores: 24
H2O cluster status:        accepting new members, healthy
H2O connection url:        http://127.0.0.1:54321
------------------------------------------------------------------------------

If there are multiple nodes within the cluster, you can also specify the detail option to list the information for each node.

 Node Details:
 ------------------------------------------------------------------------------
 Node 1
 ------------------------------------------------------------------------------
 IP:                        127.0.0.1:54321
 Healthy:                   yes
 Total cores:               24
 Allowed cores:             24
 Free memory:               4 Gb
 Free disk:                 1.359 Tb
 Pid:                       7624
 ------------------------------------------------------------------------------

You can import data from your local drive to the cluster as an H2O frame. For example, the following code will load Stata’s auto dataset to the cluster.

. sysuse auto
. _h2oframe _put, into(h2oauto)

By default, _h2oframe _put loads the entire dataset in memory to the cluster. To load a subset instead, you can specify a columnlist and the if and in qualifiers. See _h2oframe _put for more information. The dataset will be stored as an H2O frame named h2oauto in the cluster. Once the dataset is loaded to the cluster, any operations you perform on it will be handled by the cluster, not by Stata. In other words, the two copies of the auto dataset are independent of each other.

You can type _h2oframe _dir to list all H2O frames in the cluster, along with the dimensions of the data and the amount of memory the data consume in the cluster.

. _h2oframe _dir
Name                                     |        Rows        Cols        Size
-----------------------------------------+------------------------------------
h2oauto                                  |          74          12    3.982 Kb

Total: 1

For more information about H2O frames, see Introduction to H2O frames.

Close and disconnect from the H2O cluster

Once you have finished your analysis on the H2O cluster, you can type h2o shutdown to close the connection to the cluster.

If the cluster was started locally by Stata through h2o init, then h2o shutdown will fail by default and claim “Shutting it down will also close Stata”. This is because shutting down the H2O server will destroy the Java virtual machine (JVM) and exit the application that initialized the JVM. The JVM is initialized through Stata, so it will also close Stata. Specifying the force option will force the shutting down of the cluster and exit Stata. If you are not ready to exit Stata, you can leave the cluster there for now. Later, when you close Stata, the cluster will be shut down automatically. If the cluster is completely shut down, all resources within the cluster will be lost, including any data it contained.

If the cluster exists on a remote machine and you connected to it by typing h2o connect in Stata, then h2o shutdown will close the H2O session between Stata and the cluster. This will not destroy the remote H2O cluster nor the resources it contained. Additionally, you do not need to exit Stata to close the session, and you can reconnect to the H2O cluster and access everything at a later time by typing h2o connect. On the other hand, if you specify the force option, this will shut down the H2O cluster on the remote machine, which means everything in the remote cluster will be lost. Stata does not need to be closed in this case either.