Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: merge by range of values using mata
From
Luke Miner <[email protected]>
To
[email protected]
Subject
st: merge by range of values using mata
Date
Wed, 7 Mar 2012 17:42:37 +0000
Like the thread posted below, I'm trying to merge datasets by ranges.
In dataset A each observation has a unique value of X
In dataset B, each observation contains a range of X values,
classified by start_X and end_X. The remaining variable, Y, in dataset
B contain values for all Xs that fall within this range.
I want to merge the Y values using the X ranges.
I saw the approach suggested below on the list. However, I have
millions of observations and it is prohibitively slow. I've been
trying to convert this into mata with no luck. Any suggestions on mata
code?
> On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
>
> I am trying to find a way to merge data sets according to a range
> of values, sort of a combination of m:1 merge and inrange().
>
> In one data set, each observation represents a subject with
> the individual's value for variable X.
>
> In another data set, each observation represents a range of values for
> variable X. The start and end values of the range are separate variables,
> such as start_X and end_X. The remaining variables contain the values of Y
> and Z for all values of X within that range.
>
> Is there a way to merge the Y and Z data from the second data set into the
> first by comparing the value of X to the range specified by start_X and
> end_X?
>
> I thought of transforming the second data set in order to create
> new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2,
> end_X_2, Y_2, Z_2, etc., adding those data to each observation in the
> first dataset, and using a loop and inrange() in order to compute Y and
> Z for each subject, but there are about 3,000,000 different ranges of X in
> the second data set, so this is impractical.
>
> There are (at least) two ways to approach this. The first is only viable
> for small datasets, though it is worth knowing about. I'll take for granted
> that the intervals in your second dataset are non- overlapping; you should
> verify this, and if they are not, then you'll need to decide how to handle
> this. Also, I am ignoring the issue of numerical precision on the boundaries
> of your intervals; if the boundaries of your intervals are non-integer
> values, then you'll need to consider this issue as well.
>
> Here is the first approach:
>
>
> use dataset1
> cross using dataset2
> keep if inrange(x,start_x,end_x)
>
>
> Note that any records in the first dataset that do not have
> a corresponding interval in the second will be excluded from the result. A
> lower-memory variant of this is to work initially with only the variables x,
> start_x and end_x; once you've created your mapping, you can then merge your
> datasets in two steps (i.e., merge the mapping onto the first dataset, and
> then merge the result onto the second).
>
> If your dataset is too large for this approach (as it sounds like it is in
> this case), then an alternative is the following:
>
> use dataset1
> merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen
>
> gen start = .
> gen end = .
> forv i=1/`c(N)' {
> if mi(start_x[`i']) continue, break
>
> replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>
> replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
> }
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/