Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: conditional merging
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: conditional merging
Date
Wed, 7 Nov 2012 19:11:17 +0000
I am not planning to implement weights. The point about
nearest-neighbour as I define it is that unknown points get
interpolated with the value of the nearest neighbour with a known
value. I've got to think about ways of handling cases in which two
neighbours tie for nearest.
On Wed, Nov 7, 2012 at 7:03 PM, Ben Hoen <[email protected]> wrote:
> I see. I like the nearest neighbor approach in that one could calculate
> separately a weight of the "interpolation" such that as one interpolated
> values "further" (in time) away from the "known" values their weight would
> decrease.
>
> Thanks for those insights. As always, very interesting & helpful.
>
> I will see if anyone comes forward with a merge idea.
>
> Best,
>
> Ben
>
> Ben Hoen
> LBNL
> Office: 845-758-1896
> Cell: 718-812-7589
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: Wednesday, November 07, 2012 1:25 PM
> To: [email protected]
> Subject: Re: st: conditional merging
>
> I will split this into two:
>
> 0. Interpolation. Carry-forward is crude but has the advantage that
> only legitimate values that occur can be carried forward.
> I decided this morning to write a nearest-neighbour interpolation
> program, which would have the same characteristic, except that the
> nearest neighbour could be later as well as before.
> The program would just be an analogue of -ipolate- and therefore not
> assume spacing in time, but would assume position in one dimension
> (not two).
>
> 1. Merging. I am not a merge-master. There should be others on this
> list who merge day in, day out and can give better advice.
>
> Nick
>
> On Wed, Nov 7, 2012 at 3:37 PM, Ben Hoen <[email protected]> wrote:
>> Thanks Nick.
>>
>> I am not sure there is a standard way that these "condition" values trend
>> over time across the whole dataset, and therefore interpolating them might
>> not be appropriate. Moreover, for each home, there might not be many data
>> points. Finally, the values that are allowable for condition are discreet
>> (non-continuous), and therefore would complicate a linear, cubic, cubic
>> spline process (though, of course that could be dealt with by using
> .=int(x)
>> ). Would the interpolation allow me to take into account all of these
>> characteristics?
>>
>> For, in part, this reason, I was hoping to find some way to execute a
>> "conditional merge" (again, my words). Additionally, the process of
>> learning how one might do it with this "condition" data, could be applied
> to
>> extracting other characteristic data that are also only present
> sporadically
>> across time (e.g., size of the home) but that also might periodically
> change
>> (e.g., the home might be added to).
>>
>> Is there a way to use if/then statements in a merge process?
> Nick Cox
>
>> Carry forward can be as little as one line of code: see
>>
>> FAQ . . . . . . . . . . . . . . . . . . . . . . . Replacing missing
>> values
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J.
>> Cox
>> 2/03 How can I replace missing values with previous or
>> following nonmissing values?
>>
> http://www.stata.com/support/faqs/data-management/replacing-missing-values/
>>
>> I don't see that this is an imputation problem at all. It calls for
>> interpolation. Indeed, have you considered some kind of interpolation,
>> say linear, cubic, cubic spline?
>>
>> On Tue, Nov 6, 2012 at 7:33 PM, Ben Hoen <[email protected]> wrote:
>>
>>> I have two files sales.dta and condition.dta. sales.dta has two
> variables
>>> (home_id saleyear), and condition.dta has three variables (home_id
>>> inspection_year condition). The variable inspection_year can take the
>> vales
>>> of 2000-2011 for any home but for many homes only some years are present
>> (in
>>> many years the home was not inspected. Therefore a sample of the data
>> might
>>> look like:
>>>
>>> home_id inspection_year condition
>>> 50121 2002 4
>>> 50121 2006 4
>>> 50121 2011 3
>>> 50681 2004 2
>>> 50681 2010 3
>>> 51040 2006 2
>>> 51040 2010 2
>>> 51040 2011 3
>>>
>>> I would like to populate the sales.dta file with the condition of the
>> parcel
>>> in the inspection_year that is the closest to, but not following the
>>> sale_year.
>>>
>>> So, for example, the following dataset would result
>>>
>>> home_id sale_year condition
>>> 50121 2007 4
>>> 50121 2011 3
>>> 50681 2008 2
>>> 51040 2003 .
>>> 51040 2010 3
>>>
>>> I know I am not the first person to have this problem, but could not find
>>> threads on this. Maybe I am using the wrong search terms. Any help
> would
>>> be greatly appreciated.
>>>
>>> (As I wrote this I realized one not as elegant work-around would be to
>>> fill-in missing data for each missing year in the condition.dta file,
>>> potentially using the user-written "carryforward" or even imputing the
>> data
>>> using, e.g., mi impute, and then matching home_id sale_year to home_id
>>> inspection_year.)
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/