Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified

From   Austin Nichols <[email protected]>
To   [email protected]
Subject   Re: st: Identify 5 closest observations of a variable and then calculate average of another variable based on the observations identified
Date   Tue, 18 Sep 2012 10:47:50 -0400

Joseph Monte <[email protected]>:
The best way to approach this depends on the data size and structure.
If you have easy data like below, you can -cross- and compute
directly; for a large dataset, you may want to loop over observations
(cf. e.g.
To loop over observations and sort repeatedly by distance based on one
or more variables, it will behoove you to create a numeric id
corresponding to the obs number at the outset, so you can re-sort when
you are done with each iteration of the loop, which will make it easy
to refer to a specific observation.  Something like:

clear all
input str1 reg v1 v2
A  3.29515    47
A  5.39742    38
A  7.94641    43
A  11.25495   235
A  22.35908   61
A  27.19206   76
A  41.03306   66
A  45.56846   89
A  53.63861   116
A  73.2925    76
A  104.3025   63
A  229.7772   74
A  634.0973   61
A  1053.78    80
A  1163.681   47
B  2.339128   55
B  2.378151   46
B  9.831361   47
B  15.83442   57
B  16.48956   42
B  28.70144   44
B  56.01777   29
B  113.9736   103
B  178.731    47
B  340.715    103
C  0.5892565  44
C  2.016974   37
C  3.041719   76
C  4.009228   80
C  5.856674   51
C  7.587287   188
C  8.827202   66
C  11.53763   48
C  11.67932   152
C  11.86612   51
C  12.95344   84
C  14.85097   63
C  17.12918   47
C  17.74263   67
C  17.97567   75
C  20.60005   84
C  22.13938   44
C  28.99966   44
C  31.23538   55
C  31.52542   36
g long id=_n
g double m=.
forv i=1/`=_N' {
 sort id
 g d=(v1-v1[`i'])^2
 g noti=_n==`i'
 loc mr=reg[`i']
 bys noti reg (d): g f5=(_n<6) if reg=="`mr'"&noti==0
 qui count if f5==1
 if r(N)==5 {
  su v2 if f5==1, mean
  replace m=r(mean) if id==`i'
 drop d noti f5
sort id
list, noo

On Mon, Sep 17, 2012 at 12:34 PM, Joseph Monte <[email protected]> wrote:
> Dear Statalisters,
> The data below shows three variables:- region, var1 and var2. For each
> observation in a given region, I want the 5 closest observations based
> on var1 (not counting the observation in question). I basically need
> the average value of var2 for the 5 observations that are identified.
> I don't have any missing values in my data for all three variables
> below. I can also confirm that I have a few regions with less than 6
> observations each; hence these regions will be ignored. I am using
> Stata 12.
> Thanks,
> Joe
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index