Put Stata variables into an H2O frame and vice versa

Syntax

Save data in memory to an H2O frame on the H2O cluster

    _h2oframe _put [varlist] [if] [in] , into(newframename) [_put_options]

Load an existing H2O frame as the current Stata dataset

    _h2oframe _get [using] framename [if] [in] [, _get_options]

Load a subset of columns in an existing H2O frame as the current Stata dataset

    _h2oframe _get columnlist using framename [if] [in] [, _get_options]

varlist is a list of variable names in Stata’s current dataset.

columnlist is a list of column names in the H2O frame; see Specifying a list of columns for more information.

 _put_options                              Description
 -----------------------------------------------------------------------------------
 * into(newframename)                      destination H2O frame
   nolabel                                 output numeric values (not labels) of
                                             labeled variables
 -----------------------------------------------------------------------------------
 * into() is required.
 
 _get_options                              Description
 -----------------------------------------------------------------------------------
 case(preserve|lower|upper)                preserve the case or read column names as
                                             lowercase (the default) or uppercase
 asfloat                                   load all floating-point data as floats
 asdouble                                  load all floating-point data as doubles
 clear                                     replace data in memory
 -----------------------------------------------------------------------------------

Description

_h2oframe _put exports Stata’s current dataset to an H2O frame on the H2O cluster.

_h2oframe _get loads an existing H2O frame to Stata as the current dataset. All enum (categorical) columns are stored as string variables in the dataset.

When exporting Stata’s current dataset into an H2O frame with _h2oframe _put, all of Stata’s categorical/factor variables are stored as enum (categorical) columns. Read What is an H2O frame? for more information about the data types in an H2O frame. On the other hand, when loading an H2O frame into Stata by using _h2oframe _get, all enum (categorical) columns are stored as string variables in the dataset.

Options

Options for _h2oframe _put

into(newframename) specifies the destination H2O frame in which to store the Stata variables. into() is required.

nolabel specifies that the numeric values of labeled variables be exported to the H2O frame rather than the label associated with each value.

Options for _h2oframe _get

case(preserve|lower|upper) specifies the case of the column names after loading. The default is case(lower).

asfloat loads numeric data from the H2O frame as type float. The default storage type of the columns is determined by set type.

asdouble loads numeric data from the H2O frame as type double. The default storage type of the columns is determined by set type.

clear specifies to replace the data in memory, even though the current data have not been saved to disk.

Examples

 Setup
     . sysuse auto

 Export this dataset to an H2O frame named auto1
     . _h2oframe _put, into(auto1)

 Look at what we just loaded
     . _h2oframe _change auto1
     . _h2oframe _describe

 Read a subset of the data into another H2O frame named auto2 and then list
 the contents of the frame
     . _h2oframe _put make mpg foreign in 1/50, into(auto2)
     . _h2oframe _change auto2
     . _h2oframe _list

 -----------------------------------------------------------------------------------
 Load the data from the H2O frame auto1 into Stata as the current dataset and
 then list the data
     . _h2oframe _get auto1, clear
     . list

 -----------------------------------------------------------------------------------
 Same as above, but only load a subset of the data
     . _h2oframe _get make mpg rep78 foreign using auto1 in 1/10, clear
     . list

Stored results

 _h2oframe _get stores the following in r():

 Scalars
   r(N)                number of rows loaded from the H2O frame
   r(k)                number of columns loaded from the H2O frame