Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified
From
Joseph Monte <[email protected]>
To
[email protected]
Subject
Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified
Date
Tue, 18 Sep 2012 18:40:23 +0100
Dear Austin and Gordon,
Thanks for your suggestions.
Sorry I was not clear in my email. I have a counter variable called
"identifier" which is unique for each observation. For each
"identifier" within a certain "region" (I have approx 4000 identifiers
and 40 regions), I basically want the average value of "var2" for the
five closest values of "var1" based on value. I tried the following
code below and it seems to "almost" solve my problem (although it is
most likely inefficient since I am creating 16 variables and then
using -reshape- to get what I want). My concern is that it does not
handle the possibility that "var1" may be the same with a given
"region". There are 3 instances when "var1" has the same value twice
within a given "region". In this case, I want the average "var2" to be
taken of the two observations. I am hence using values from 1 to 8 in
the loop below to capture the possibility that all three duplicates
may occur for a given observation (I would normally use 1 to 5 if
there were no duplicates of "var1"). I am not clear on how to capture
the average value of "var2" if "var1" is the same within a given
"region" and would appreciate any help.
-----------------------begin code ----------------------
forv i=1/8 {
bys region (var1): gen diffvar1minus`i'=var1-var1[_n-`i']
bys region (var1): gen diffvar1plus`i'=abs(var1-var1[_n+`i'])
bys region (var1): gen var2minus`i'=var2[_n-`i']
bys region (var1): gen var2plus`i'=var2[_n+`i']
}
reshape long diffvar1 var2, i(identifier) j(minusplus5, string)
drop if diffvar1==.
bys identifier (diffvar1): gen id1=_N
drop if id1<5
bys identifier (diffvar1): gen id=_n
keep if id<=5
drop id id1
bys identifier: egen avgvar2=mean(var2)
-------------------------end code----------------------------
Thanks,
Joe
On Tue, Sep 18, 2012 at 3:47 PM, Austin Nichols <[email protected]> wrote:
> Joseph Monte <[email protected]>:
> The best way to approach this depends on the data size and structure.
> If you have easy data like below, you can -cross- and compute
> directly; for a large dataset, you may want to loop over observations
> (cf. e.g. http://www.stata.com/statalist/archive/2007-10/msg00346.html).
> To loop over observations and sort repeatedly by distance based on one
> or more variables, it will behoove you to create a numeric id
> corresponding to the obs number at the outset, so you can re-sort when
> you are done with each iteration of the loop, which will make it easy
> to refer to a specific observation. Something like:
>
> clear all
> input str1 reg v1 v2
> A 3.29515 47
> A 5.39742 38
> A 7.94641 43
> A 11.25495 235
> A 22.35908 61
> A 27.19206 76
> A 41.03306 66
> A 45.56846 89
> A 53.63861 116
> A 73.2925 76
> A 104.3025 63
> A 229.7772 74
> A 634.0973 61
> A 1053.78 80
> A 1163.681 47
> B 2.339128 55
> B 2.378151 46
> B 9.831361 47
> B 15.83442 57
> B 16.48956 42
> B 28.70144 44
> B 56.01777 29
> B 113.9736 103
> B 178.731 47
> B 340.715 103
> C 0.5892565 44
> C 2.016974 37
> C 3.041719 76
> C 4.009228 80
> C 5.856674 51
> C 7.587287 188
> C 8.827202 66
> C 11.53763 48
> C 11.67932 152
> C 11.86612 51
> C 12.95344 84
> C 14.85097 63
> C 17.12918 47
> C 17.74263 67
> C 17.97567 75
> C 20.60005 84
> C 22.13938 44
> C 28.99966 44
> C 31.23538 55
> C 31.52542 36
> end
> g long id=_n
> g double m=.
> forv i=1/`=_N' {
> sort id
> g d=(v1-v1[`i'])^2
> g noti=_n==`i'
> loc mr=reg[`i']
> bys noti reg (d): g f5=(_n<6) if reg=="`mr'"¬i==0
> qui count if f5==1
> if r(N)==5 {
> su v2 if f5==1, mean
> replace m=r(mean) if id==`i'
> }
> drop d noti f5
> }
> sort id
> list, noo
>
>
> On Mon, Sep 17, 2012 at 12:34 PM, Joseph Monte <[email protected]> wrote:
>> Dear Statalisters,
>>
>> The data below shows three variables:- region, var1 and var2. For each
>> observation in a given region, I want the 5 closest observations based
>> on var1 (not counting the observation in question). I basically need
>> the average value of var2 for the 5 observations that are identified.
>> I don't have any missing values in my data for all three variables
>> below. I can also confirm that I have a few regions with less than 6
>> observations each; hence these regions will be ignored. I am using
>> Stata 12.
>>
>> Thanks,
>>
>> Joe
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/