Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merge by range of values using mata
From
Robert Picard <[email protected]>
To
[email protected]
Subject
Re: st: merge by range of values using mata
Date
Wed, 7 Mar 2012 15:09:24 -0500
If the ranges in B do not overlap, then here is a simple and very fast
way to doing this. Robert
*----------- begin example -------------
version 12
set seed 1234
clear
* Create observations with X between 0 and 100
set obs 20
gen idX = _n
gen X = runiform() * 100
list
tempfile A
save "`A'"
* Create non-overlapping ranges for X
clear
set obs 5
gen idB = _n
gen start_X = (_n-1) * 15
gen end_X = start_X + 10
gen Y = runiform() * 10
list
* Expand each range and define values for X
expand 2
sort idB
by idB: gen X = start_X if _n == 1
by idB: replace X = end_X if _n == 2
list, sepby(idB)
* Create a trick id (see below)
by idB: gen idB2 = cond(_n==1,idB,-idB)
* Append observations from A and order by X
append using "`A'"
sort X
gen idrange = sum(idB2)
list, noobs sepby(idrange)
* Target obs from A that fall within an idrange
keep if idrange != 0
sort idrange Y
by idrange: replace Y = Y[1]
list, noobs sepby(idrange)
drop if mi(idX)
list, noobs sepby(idrange)
*------------ end example --------------
On Wed, Mar 7, 2012 at 12:42 PM, Luke Miner <[email protected]> wrote:
> Like the thread posted below, I'm trying to merge datasets by ranges.
>
>
> In dataset A each observation has a unique value of X
>
>
> In dataset B, each observation contains a range of X values,
> classified by start_X and end_X. The remaining variable, Y, in dataset
> B contain values for all Xs that fall within this range.
>
>
> I want to merge the Y values using the X ranges.
>
>
> I saw the approach suggested below on the list. However, I have
> millions of observations and it is prohibitively slow. I've been
> trying to convert this into mata with no luck. Any suggestions on mata
> code?
>
>
>
>
>> On Jun 13, 2011, at 4:49 PM, Jeremy A. Grey wrote:
>>
>> I am trying to find a way to merge data sets according to a range
>> of values, sort of a combination of m:1 merge and inrange().
>>
>> In one data set, each observation represents a subject with
>> the individual's value for variable X.
>>
>> In another data set, each observation represents a range of values for
>> variable X. The start and end values of the range are separate variables,
>> such as start_X and end_X. The remaining variables contain the values of Y
>> and Z for all values of X within that range.
>>
>> Is there a way to merge the Y and Z data from the second data set into the
>> first by comparing the value of X to the range specified by start_X and
>> end_X?
>>
>> I thought of transforming the second data set in order to create
>> new variables, such as start_X_1, end_X_1, Y_1, Z_1, start_X_2,
>> end_X_2, Y_2, Z_2, etc., adding those data to each observation in the
>> first dataset, and using a loop and inrange() in order to compute Y and
>> Z for each subject, but there are about 3,000,000 different ranges of X in
>> the second data set, so this is impractical.
>>
>> There are (at least) two ways to approach this. The first is only viable
>> for small datasets, though it is worth knowing about. I'll take for granted
>> that the intervals in your second dataset are non- overlapping; you should
>> verify this, and if they are not, then you'll need to decide how to handle
>> this. Also, I am ignoring the issue of numerical precision on the boundaries
>> of your intervals; if the boundaries of your intervals are non-integer
>> values, then you'll need to consider this issue as well.
>>
>> Here is the first approach:
>>
>>
>> use dataset1
>> cross using dataset2
>> keep if inrange(x,start_x,end_x)
>>
>>
>> Note that any records in the first dataset that do not have
>> a corresponding interval in the second will be excluded from the result. A
>> lower-memory variant of this is to work initially with only the variables x,
>> start_x and end_x; once you've created your mapping, you can then merge your
>> datasets in two steps (i.e., merge the mapping onto the first dataset, and
>> then merge the result onto the second).
>>
>> If your dataset is too large for this approach (as it sounds like it is in
>> this case), then an alternative is the following:
>>
>> use dataset1
>> merge 1:1 _n using dataset2, keepusing(start_x end_x) nogen
>>
>> gen start = .
>> gen end = .
>> forv i=1/`c(N)' {
>> if mi(start_x[`i']) continue, break
>>
>> replace start = start_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>>
>> replace end = end_x[`i'] if inrange(x,start_x[`i'],end_x[`i'])
>> }
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/