|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: calculating nearest neighbors; looping back to the beginningof observations
From |
"David M. Drukker" <[email protected]> |
To |
[email protected] |
Subject |
Re: st: calculating nearest neighbors; looping back to the beginningof observations |
Date |
Thu, 11 Oct 2007 09:54:54 -0500 (CDT) |
Sarah Cohodes <[email protected]> asked how to calculate a summary
statistic of a variable for the k nearest neighbors.
Austin Nichols <[email protected]> replied with a good solution.
We put the minindex() function into Mata to handle problems like Sarah's.
Below I outline a possible solution method using the minindex() function in
Mata.
The two advantages of this solution are that it is fast and that the
minindex() function returns exactly what you want, a vector of the indices
of the smallest distances.
I have appended a version of the code for a problem like Sarah's below.
The code
1) simulates some data;
2) copies the variables into Mata vectors; and
3) for each observation it
a) finds the vector of indices of the closest observations,
b) extracts the vector of the closest observations from y, and
c) calculates the mean of the closest observations in y.
To illustrate how the code works, ind, y[ind] and mean(y[ind]) are
displayed. In adopting this code for her own use, Sarah could remove
these display statements.
To keep it simple, tied distances would expand the number of indices
returned by minindex() as discussed in help mata minindex().
I hope that this helps.
--David
[email protected]
---------------------------Begin example code----------------------------------
version 10
clear all
set seed 12345
set obs 1000
gen x1 = uniform()
gen x2 = uniform()
gen y = invnormal(uniform()) + x1^2 + x2^2
mata:
x1 = st_data(., "x1") // put x1 variable into x1 vector
x2 = st_data(., "x2") // put x2 variable into x2 vector
y = st_data(., "y") // put y variable into y vector
n = rows(x1)
ind = . // initialize ind vector
w = . // initialize w vector
// loop over observations
// I am working over first 3
// observations for illustration
// purposes
// change the 3 to n for the full
// problem
for(i=1; i<=3; ++i) {
// calculate distance for i(th)
// observation
d = sqrt((x1:-x1[i]):^2 + (x2:-x2[i]):^2)
//put vector of minimum indices into
// ind, if no ties ignore w, if ties
// use w to handle ties
minindex(d, 5, ind, w)
// display ind
"ind is "
ind
// display corresponding values from y
"y extract is "
y[ind]
// calculate mean of appropriate
// values of y
mean(y[ind])
}
end
---------------------------End example code----------------------------------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/