Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: conditional merging |
Date | Wed, 7 Nov 2012 19:11:17 +0000 |
I am not planning to implement weights. The point about nearest-neighbour as I define it is that unknown points get interpolated with the value of the nearest neighbour with a known value. I've got to think about ways of handling cases in which two neighbours tie for nearest. On Wed, Nov 7, 2012 at 7:03 PM, Ben Hoen <bhoen@lbl.gov> wrote: > I see. I like the nearest neighbor approach in that one could calculate > separately a weight of the "interpolation" such that as one interpolated > values "further" (in time) away from the "known" values their weight would > decrease. > > Thanks for those insights. As always, very interesting & helpful. > > I will see if anyone comes forward with a merge idea. > > Best, > > Ben > > Ben Hoen > LBNL > Office: 845-758-1896 > Cell: 718-812-7589 > > > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu > [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox > Sent: Wednesday, November 07, 2012 1:25 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: conditional merging > > I will split this into two: > > 0. Interpolation. Carry-forward is crude but has the advantage that > only legitimate values that occur can be carried forward. > I decided this morning to write a nearest-neighbour interpolation > program, which would have the same characteristic, except that the > nearest neighbour could be later as well as before. > The program would just be an analogue of -ipolate- and therefore not > assume spacing in time, but would assume position in one dimension > (not two). > > 1. Merging. I am not a merge-master. There should be others on this > list who merge day in, day out and can give better advice. > > Nick > > On Wed, Nov 7, 2012 at 3:37 PM, Ben Hoen <bhoen@lbl.gov> wrote: >> Thanks Nick. >> >> I am not sure there is a standard way that these "condition" values trend >> over time across the whole dataset, and therefore interpolating them might >> not be appropriate. Moreover, for each home, there might not be many data >> points. Finally, the values that are allowable for condition are discreet >> (non-continuous), and therefore would complicate a linear, cubic, cubic >> spline process (though, of course that could be dealt with by using > .=int(x) >> ). Would the interpolation allow me to take into account all of these >> characteristics? >> >> For, in part, this reason, I was hoping to find some way to execute a >> "conditional merge" (again, my words). Additionally, the process of >> learning how one might do it with this "condition" data, could be applied > to >> extracting other characteristic data that are also only present > sporadically >> across time (e.g., size of the home) but that also might periodically > change >> (e.g., the home might be added to). >> >> Is there a way to use if/then statements in a merge process? > Nick Cox > >> Carry forward can be as little as one line of code: see >> >> FAQ . . . . . . . . . . . . . . . . . . . . . . . Replacing missing >> values >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. >> Cox >> 2/03 How can I replace missing values with previous or >> following nonmissing values? >> > http://www.stata.com/support/faqs/data-management/replacing-missing-values/ >> >> I don't see that this is an imputation problem at all. It calls for >> interpolation. Indeed, have you considered some kind of interpolation, >> say linear, cubic, cubic spline? >> >> On Tue, Nov 6, 2012 at 7:33 PM, Ben Hoen <bhoen@lbl.gov> wrote: >> >>> I have two files sales.dta and condition.dta. sales.dta has two > variables >>> (home_id saleyear), and condition.dta has three variables (home_id >>> inspection_year condition). The variable inspection_year can take the >> vales >>> of 2000-2011 for any home but for many homes only some years are present >> (in >>> many years the home was not inspected. Therefore a sample of the data >> might >>> look like: >>> >>> home_id inspection_year condition >>> 50121 2002 4 >>> 50121 2006 4 >>> 50121 2011 3 >>> 50681 2004 2 >>> 50681 2010 3 >>> 51040 2006 2 >>> 51040 2010 2 >>> 51040 2011 3 >>> >>> I would like to populate the sales.dta file with the condition of the >> parcel >>> in the inspection_year that is the closest to, but not following the >>> sale_year. >>> >>> So, for example, the following dataset would result >>> >>> home_id sale_year condition >>> 50121 2007 4 >>> 50121 2011 3 >>> 50681 2008 2 >>> 51040 2003 . >>> 51040 2010 3 >>> >>> I know I am not the first person to have this problem, but could not find >>> threads on this. Maybe I am using the wrong search terms. Any help > would >>> be greatly appreciated. >>> >>> (As I wrote this I realized one not as elegant work-around would be to >>> fill-in missing data for each missing year in the condition.dta file, >>> potentially using the user-written "carryforward" or even imputing the >> data >>> using, e.g., mi impute, and then matching home_id sale_year to home_id >>> inspection_year.) * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/