What is an H2O frame?¶
An H2O frame is a 2D array of data with named columns and numbered rows, where the contents of each column are of the same type. The column type may be one of the following:
Type
Description
int
Numeric column with integer values
real
Numeric column with float or double values
enum
Categorical or factor column
time
Long milliseconds since the Unix Epoch
string
String column
uuid
UUID, an immutable universally unique identifier
bad
Column with all NA values
The H2O frame is the main data structure used on the H2O cluster for data manipulation.
Below is a tutorial showing you how to load data to the H2O cluster as an H2O frame and how to manipulate the frame from within Stata. At times, we will show you a syntax diagram. For example, we might show you
_h2oframe copy framename newframename
When we show syntax diagrams in the tutorial, usually we will not show the full specification. For example, _h2oframe copy also allows a replace option. You can click on the command to see the full syntax.
Load data to the H2O cluster into an H2O frame¶
There are many ways to generate H2O frames on the H2O cluster from within Stata. For example, you can load Stata’s current dataset as an H2O frame, create a new H2O frame with random data, or load various data files into an H2O frame.
Load Stata’s current dataset into an H2O frame¶
You can load Stata’s current dataset as an H2O frame by using the _h2oframe put command:
_h2oframe put [varlist] [if] [in], into(newframename)
You can load the entire dataset or a subset of it to an H2O frame. With this command, you can load your data into Stata, do some data wrangling using Stata’s data management commands, and then load them to the H2O cluster for further analysis. In other words, you can combine the data-processing capabilities of Stata and H2O in a single environment. For example,
. sysuse auto // load auto data in Stata
. by foreign: generate id = _n // generate indices for each group
. _h2oframe put, into(h2oauto) // load the data into an H2O frame
. _h2oframe dir // list the frames on the cluster
After the data are successfully loaded into an H2O frame, all the operations on it will be performed by the cluster instead of by Stata. In other words, the two copies of the dataset are independent of each other. You can even clear the copy in Stata to save memory, while keeping the copy on the H2O cluster.
. clear // clear the data in Stata
. _h2oframe dir // the H2O frame is still there
Create a new H2O frame¶
A new H2O frame with random data can be directly created on the cluster by using the _h2oframe create command:
_h2oframe create newframename
By default, the frame contains 10,000 rows and 10 columns. The column types and values in each column are randomly generated by H2O. _h2oframe create provides a lot of options to control how this frame is created. For example, you can customize its dimensions, the proportion of each type of column that it contains, and so on.
Remember that the H2O frame created above exists on the cluster, not in Stata. You can use _h2oframe get to export it to Stata as the current dataset, as shown below:
. _h2oframe create myframe, rseed(17) // create a new H2O frame
. _h2oframe get myframe, clear // load it into Stata
. describe // describe its contents
. summarize // summarize the data
See Save data locally for more details.
Import or upload data files into an H2O frame¶
You can load data files, such as .csv files, directly into the cluster as H2O frames by using the _h2oframe import and _h2oframe upload commands:
_h2oframe import path, into(newframename) _h2oframe upload path, into(newframename)
The path can be a local path or a remote path specified by a URL. The file format must be compatible with H2O. See Getting Data into Your H2O Cluster for more information about the file formats and data sources that H2O supports.
Manipulate H2O frames¶
Remember that H2O frames exist on the H2O cluster instead of within Stata. You cannot use Stata’s data management commands to manipulate them. Instead, a suite of Stata commands allows you to manipulate the frames on the cluster. We discuss some of these commands below; see Create and manipulate H2O frames for a complete list of such commands.
Display all H2O frames¶
You can list all H2O frames on the cluster by using the _h2oframe dir command:
_h2oframe dir
This command lists each frame’s name, dimensions, the amount of memory it occupies, and the total number of frames within the cluster. The frame name is the ID that Stata uses to track it on the cluster.
Another important role that _h2oframe dir plays is to track all the H2O frames on the cluster from within Stata, in case Stata gets disconnected from the cluster. For example, suppose you connect to an existing H2O cluster built by your colleague by using h2o connect from within Stata. You load some data into the cluster as an H2O frame and perform some operations on the frame. Then, you lose the connection between Stata and the cluster. Later, when the connection is re-established, you must use _h2oframe dir to rebuild the tracking information about all the H2O frames on the cluster.
. h2o connect, ip(#.#.#.#) port(#) // connect to an existing H2O cluster
// created by your colleague
. sysuse auto // load your data into the cluster
. _h2oframe put, into(h2oauto)
. perform some tasks on h2oauto
. exit Stata
. reopen Stata
. h2o connect, ip(#.#.#.#) port(#) // reconnect to the cluster
. _h2oframe dir // retrack the frames in the cluster
. _h2oframe get h2oauto // load the frame you created into Stata
Copy H2O frames¶
_h2oframe copy makes a copy of an existing H2O frame on the cluster:
_h2oframe copy framename newframename
It is a good habit to make a copy of an H2O frame before you perform any data manipulation on it. This is because some data operations are not reversible, such as dropping columns and observations from the frame.
Imagine that you loaded Stata’s current working dataset into an H2O frame and then dropped some observations from the frame based on some condition. Later on, you find that the condition you used was wrong, and you want to revert your frame to its original state. You would have to reload the data from Stata into a new H2O frame and drop the observations again based on the refined condition. If your dataset is small, this is not a big deal. But if your dataset contains millions of observations, redoing all the changes may take a significant amount of time. Making a copy of your dataset on the cluster will be very useful under such circumstances.
Split H2O frames¶
When training a machine learning model, it is common to split your dataset into two datasets: one for training the model and the other for evaluating the model performance. This can be done using the _h2oframe split command:
_h2oframe split framename, into(newframelist) ratio(numlist)
For example, the following command splits an existing H2O frame myframe into two subframes, named train and test:
. _h2oframe split myframe, into(train test) ratio(0.8 0.2)
Unlike Stata’s splitsample, which performs an exact split on Stata’s current dataset, _h2oframe split may not provide an exact split. Instead, the ratios among the resulting H2O frames may be approximate to the ratios you provided. This is because H2O uses a probabilistic splitting method to segment the original frame, which will be more efficient for big data than the exact splitting method.
Combine H2O frames¶
You can combine multiple H2O frames rowwise or columnwise into one:
_h2oframe rbind framenamelist [, into(newframename)] _h2oframe cbind framenamelist [, into(newframename)]
By default, the data from the subsequent H2O frames will be appended to the first one. You can specify the into() option to combine all data into a new H2O frame.
The current H2O frame¶
While you may have many H2O frames on the cluster, you can only work on one frame at a time. We refer to the H2O frame you are working on as the current frame. Any H2O operations you perform, such as dropping columns and observations or listing values of columns, will be done on the current frame. You can type _h2oframe pwf to display the name of the current H2O frame:
_h2oframe pwf
If you haven’t specified any H2O frame to the current frame or you want to change the current one to a different one, you can do so with _h2oframe change:
_h2oframe change framename
For example, suppose you have an H2O frame named myframe within the cluster, and you want to describe the data and obtain summary statistics. You can type
. _h2oframe change myframe
. _h2oframe describe
. _h2oframe summarize
Work on the current H2O frame¶
This section introduces a few commands you can use to work on the current H2O frame. For a complete list, see Create and manipulate H2O frames. Before using these commands, you need to use _h2oframe change to set a frame as the current one; see The current H2O frame.
Inspect the current H2O frame¶
You can produce a summary report for the current H2O frame by typing
_h2oframe describe [columnlist]
This command will display the meta information on each column within the frame, such as the type; number of missing values, zeros, and positive and negative infinity values; and number of levels for the categorical column.
You can choose to report the summary for the whole frame or for selected columns by specifying a column list, columnlist. We refer to this as a columnlist and not a variable list (varlist) because we are working on an H2O frame. However, a columnlist can be specified in almost the same way as a varlist; see Specifying a list of columns for more information.
_h2oframe summarize displays a variety of univariate statistics for each column:
_h2oframe summarize [columnlist] [if] [in]
This command will calculate the number of non-missing observations, mean, standard deviation, and minimum and maximum values for each column in the frame. You can also use the if and in qualifiers to restrict the observations; see Filtering observations for a discussion about how to use qualifiers on H2O frames.
If you want to take a quick look at the data in the frame, you can use the _h2oframe list command:
_h2oframe list [columnlist] [if] [in]
By default, it only shows the first 10 observations. This is intentional, because the data are fetched from the H2O cluster dynamically. Unlike Stata’s list command, which displays values from Stata’s memory, _h2oframe list may not be efficient if the frame contains millions of observations. Having said that, you can use the if and in qualifiers to restrict the list to only the observations in which you are interested.
Rename columns¶
Before you proceed to conduct data wrangling and analysis on an H2O frame within Stata, it is a good idea to check the column names. Some column names of the frame you are working on may not match Stata’s naming conventions. For example, “Mileage (mpg)” is a valid column name for an H2O frame, but it is an invalid variable name in Stata. When you feed it to the _h2oframe commands, they may fail to execute due to invalid syntax.
You can use _h2oframe rename to rename the columns:
_h2oframe rename old new _h2oframe rename (old1 old2 ...) (new1 new2 ...)
You can rename columns one at a time or rename multiple columns at once.
Change column types¶
There may be times when you want to convert a numeric column to a categorical (enum) column. For example, suppose you want to train an H2O classification model with an H2O frame, and thus you expect to see some classification metric, such as the classification accuracy, in the output. However, suppose that the output instead shows that you actually trained a regression model. This can happen if the target variable (dependent variable) is stored as a numeric type instead of a categorical (enum) type in the frame. Because the target variable is numeric, H2O automatically trains a regression model. You can use _h2oframe toenum to convert the original column to a categorical (enum) column before training the model.
_h2oframe toenum columnlist {generate(newcolumnlist)|replace}
The original type of column name does not need to be numeric; a string type would also work. When using _h2oframe toenum with string columns, H2O will create a map between the string values and corresponding integer values internally in the result column.
If you have a column containing all numeric values but the column type is stored as string, you can use _h2oframe tonumeric to convert it to a numeric type:
_h2oframe tonumeric columnlist, {generate(newcolumnlist)|replace}
In addition, you can use _h2oframe tostring to convert columns to string columns:
_h2oframe tostring columnlist, {generate(newcolumnlist)|replace}
Create or change the contents of a column¶
_h2oframe generate and _h2oframe replace are used to create new columns and to modify the contents of existing columns, respectively:
_h2oframe generate newcolname =exp _h2oframe replace oldcolname =exp [if] [in]
The values of the new column and the values used as replacements for the old column are specified with =exp. See Using functions and expressions for a detailed discussion about specifying expressions with H2O frames.
Save data locally¶
_h2oframe export downloads the current H2O frame from the cluster to a .csv file on your local disk:
_h2oframe export [using] filename [if] [in] _h2oframe export [columnlist] using filename [if] [in]
Before saving the dataset locally, make sure that you have enough disk space to hold the .csv file because the H2O frame on the cluster may be very large. Instead of downloading the whole frame, you can just download the subset you are interested in by specifying a column list (columnlist) and the if and in qualifiers.
In addition to saving the frame data to a .csv file, you can also load all the frame data or a subset of it to Stata as the current dataset:
_h2oframe get [using] framename [if] [in] _h2oframe get columnlist using framename [if] [in]
_h2oframe get works with any H2O frames on the cluster, while _h2oframe export only works for the current H2O frame.
Miscellaneous¶
Unlike other Stata commands that work with data in Stata’s memory, the suite of _h2oframe commands deal with data on the H2O cluster; in other words, the _h2oframe commands deal with the H2O frame. All the work done on the data is performed by H2O, not by Stata. Although the syntax of the _h2oframe commands is similar to other Stata commands, the _h2oframe suite works in a different way. Below, we will discuss three main differences between them.
Specifying a list of columns¶
A lot of Stata commands allow you to specify a variable name (varname) or a list of variable names (varlist). These names refer to variables in the current dataset in Stata. The commands use those target variables to perform specific tasks. If one of the variables you specified does not exist in the current dataset, a “variable not found” error will be issued.
In the same way, some of the _h2oframe commands also allow you to specify a variable name or a list of variable names to work on. A variable refers to a column in the frame, so we use columnname and columnlist in the command syntax, which play the same role as varname and varlist in other Stata commands. Having said that, because the _h2oframe commands deal with data from the H2O frames, when you specify the column names in an _h2oframe command, we will check the existence of those columns or variables in the current H2O frame instead of the current dataset in Stata. In other words, if you specify a column or variable name that exists in the current dataset but not in the current H2O frame, a “column not found” error will be issued.
When you specify a list of column names in an _h2oframe command, you can specify each one explicitly with its full name or you can use abbreviations, just like you would with a varlist. For example,
mycol just one column
mycol thiscol thatcol three columns
mycol* columns starting with mycol
*col columns ending with col
my*col columns starting with my & ending with col, with any number of
other characters between
my~col one column starting with my & ending with col, with any number
of other characters between
my?col columns starting with my & ending with col, with one other
character between
this-that columns this through that, inclusive
The * character indicates to match one or more characters. All columns matching the pattern are returned.
The ~ character also indicates to match one or more characters, but unlike *, only one column is allowed to match. If more than one column matches, an error message is presented.
The ? character matches one character. All columns matching the pattern are returned.
The - character indicates to return the column to the left of the - and return the column to the right of the - as well as all columns in between them.
Note that factor-variable and time-series operators are not allowed in the column specifications.
Using functions and expressions¶
Expressions are found in several places in the Stata language. The =exp language element is used in a lot of commands, such as generate and replace. The if exp language element is another place where expressions are allowed. For instance, in
. summarize mrg dvc if region=="West" & mrg>.02
the region==”West” & mrg>.02 is a logical expression.
A few _h2oframe commands allow you to use the =exp language element and the if exp language element. For instance,
. _h2oframe generate mpglog = log(mpg)
creates a new column mpglog with values equal to the logarithm of the existing column mpg in the current H2O frame.
However, the expressions in those language elements are evaluated differently than they would be by other Stata commands because these expressions are evaluated by H2O. So if you use a function in the expressions, the function must be supported by H2O; see H2O frame functions for a complete list of H2O functions that can be used. Additionally, if you use a column name in an expression, the column must exist in the current H2O frame instead of in Stata’s current dataset.
Filtering observations¶
The if and in qualifiers can also be used in the _h2oframe commands. The syntax is the same as with other Stata commands:
h2ocommand if exp h2ocommand in range
where exp means an expression, such as age>21. See Using functions and expressions for an explanation of expressions in an _h2oframe command.
range is #, #/#, f/#, or #/l, where f means the first observation and l means the last observation in the current H2O frame.