Split an existing H2O frame into multiple H2O frames¶

Syntax¶

    _h2oframe split framename, into(newframelist) [ratio(numlist) rseed(#)]

 options                        Description
 -----------------------------------------------------------------------------------
 * into(newframelist)           specify list of destination H2O frames
   ratio(numlist)               specify numlist of proportions or ratios for the
                                  split
   rseed(#)                     specify random-number seed
 -----------------------------------------------------------------------------------
 * into() is required.

Description¶

_h2oframe split splits an existing H2O frame into a list of H2O frames based on the specified proportions for each frame.

Options¶

into(newframelist) specifies a list of names for the new frames that will be created by splitting the existing H2O frame. into() is required and at least two names must be specified.

ratio(numlist) splits the H2O frame into a list of H2O frames whose sizes are proportional to the values of numlist. The values of numlist can be any positive number and must sum to 1. The number of numbers specified in ratio() must be equal to the number of names specified in into(). The default is ratio(0.75 0.25).

Unlike Stata’s splitsample, which performs an exact split on Stata’s current dataset, _h2oframe split may not provide an exact split. Instead, the ratios among the resulting H2O frames may be approximate to the ratios you provided. This is because H2O uses a probabilistic splitting method to segment the original frame, which will be more efficient for big data than the exact splitting method.

rseed(#) sets the random-number seed. This option can be used to reproduce the split.

Examples¶

 Setup
     . webuse iris, clear
     . _h2oframe put, into(iris)

 Split the iris H2O frame into two H2O frames, with approximately 80% of the data
 stored in frame iris1 and 20% of the data stored in frame iris2
     . _h2oframe split iris, into(iris1 iris2) ratio(0.8 0.2) rseed(17)

 List all H2O frames
     . _h2oframe dir

 -----------------------------------------------------------------------------------
 Split the iris data into two samples by using splitsample, with 80% of observations
 in sample 1 and 20% of observations in sample 2
     . webuse iris, clear
     . splitsample, generate(svar, replace) split(0.8 0.2)

 Store each sample in a separate H2O frame
     . _h2oframe put if svar==1, into(iris3)
     . _h2oframe put if svar==2, into(iris4)

 List all H2O frames
     . _h2oframe dir