Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Merge by range of values
From
Phil Schumm <[email protected]>
To
[email protected]
Subject
Re: st: Merge by range of values
Date
Mon, 13 Jun 2011 19:57:26 -0500
On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
I am trying to find a way to merge data sets according to a range of
values, sort of a combination of m:1 merge and inrange().
In one data set, each observation represents a subject with the
individual's value for variable X.
In another data set, each observation represents a range of values
for variable X. The start and end values of the range are separate
variables, such as start_X and end_X. The remaining variables
contain the values of Y and Z for all values of X within that range.
Is there a way to merge the Y and Z data from the second data set
into the first by comparing the value of X to the range specified by
start_X and end_X?
I thought of transforming the second data set in order to create new
variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2, end_X_2,
Y_2, Z_2, etc., adding those data to each observation in the first
dataset, and using a loop and inrange() in order to compute Y and Z
for each subject, but there are about 3,000,000 different ranges of
X in the second data set, so this is impractical.
There are (at least) two ways to approach this. The first is only
viable for small datasets, though it is worth knowing about. I'll
take for granted that the intervals in your second dataset are non-
overlapping; you should verify this, and if they are not, then you'll
need to decide how to handle this. Also, I am ignoring the issue of
numerical precision on the boundaries of your intervals; if the
boundaries of your intervals are non-integer values, then you'll need
to consider this issue as well.
Here is the first approach:
use dataset1
cross using dataset2
keep if inrange(x,start_x,end_x)
Note that any records in the first dataset that do not have a
corresponding interval in the second will be excluded from the
result. A lower-memory variant of this is to work initially with only
the variables x, start_x and end_x; once you've created your mapping,
you can then merge your datasets in two steps (i.e., merge the mapping
onto the first dataset, and then merge the result onto the second).
If your dataset is too large for this approach (as it sounds like it
is in this case), then an alternative is the following:
use dataset1
merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen
gen start = .
gen end = .
forv i=1/`c(N)' {
if mi(start_x[`i']) continue, break
replace start = start_x[`i'] if
inrange(x,start_x[`i'],end_x[`i'])
replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
}
drop start_x end_x
ren start start_x
ren end end_x
merge m:1 start_x end_x using dataset2
Unlike the first approach, this approach will retain all records
(including those from the first dataset without a corresponding
interval in the second, and those in the second without a matching
observation in the first); you may use the -keep()- option on the
second merge command to exclude one or both of these if you wish.
Note that this second approach does something you should in general
avoid doing; that is, using Stata code to loop manually through the
observations in a large dataset. I did this only to illustrate the
technique; moving the loop into Mata would speed it up considerably.
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/