Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Calculating Euclidean Distance
From
Anthony Laverty <[email protected]>
To
[email protected]
Subject
Re: st: Calculating Euclidean Distance
Date
Fri, 11 Jun 2010 16:55:43 +0100
I certainly don't indeed! You got it in one that i am matching on
patient volume pre-policy changes, using a dummy variable for before
and after it takes effect. I think i am indeed moving down the road of
estimating a few different ways on the data and in simulations, and
comparing the results, so your code and pointers toward -xtdpd- is
very helpful
Many thanks
Anthony
On Fri, Jun 11, 2010 at 3:29 PM, Austin Nichols <[email protected]> wrote:
> Anthony Laverty <[email protected]> :
>
> Well, you certainly don't want to match on your outcome variable, so I
> assume you are matching on patient volumes from the pre period, before
> any policy changes, and maybe you have a dummy t measuring whether a
> particular policy was instituted, and you have an outcome y which is
> patient volume at some later date. Then define x1 to x12 for months 1
> to 12 of the pre period (or whatever months are in the pre period),
> and use -nnmatch- (remembering that you can get x1 to x12 from the
> data structure you outlined via -reshape- to wide form). See also
> -help xtdpd- and related manual entries, if you want to compare to a
> regression taking account of the lagged dep var on the RHS. But
> compare some other approaches:
>
> set seed 1234
> clear
> input str1 hospital time patients
> A 1 456
> A 2 759
> A 3 236
> B 1 214
> B 2 854
> B 3 325
> C 1 250
> C 2 321
> C 3 852
> end
> * make more fake data
> expand 100
> ren patients x
> bys time (hospital): g g=_n
> drop hospital
> replace x=ceil(uniform()*x)
> reshape wide x, i(g) j(time)
> *make a fake treatment corr with observed x
> g byte t=(uniform()<x2/500)
> g y=ceil(x1^2+x2^2/2+x3^2/3+t+rnormal()*10)
> * estimate effect of treatment t with nnmatch or reg
> nnmatch y t x1-x3, met(maha) bias(bias) robust(4)
> reg y t
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3
> *now parametric propensity score reweighting
> qui logit t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3
> predict p
> g pw=cond(t,1/p,1/(1-p))
> reg y t [pw=pw]
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=pw]
> *now nonparametric propensity score reweighting
> forv i=1/3 {
> xtile z`i'=x`i', nq(4)
> }
> egen np=mean(t), by(z1 z2 z3)
> g npw=cond(t,1/np,1/(1-np))
> reg y t [pw=npw]
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=npw]
>
> The last, a double-robust approach with nonparametric propensity score
> reweighting, has a variety of proven advantages over alternatives.
> None has sufficient power, but some think they do... you may want to
> design a simulation based on your data and some hypothesized treatment
> effects, to see what seems to have the lowest bias or MSE in your
> design. Or just estimate 10 different ways, and hope you get similar
> answers!
>
>
> On Fri, Jun 11, 2010 at 4:43 AM, Anthony Laverty
> <[email protected]> wrote:
>> Fair enough, i didnt really give too much more away. After the
>> matching i am planning on running a difference in difference analysis
>> to assess for the effect of policy changes on patient numbers, using
>> the matches as a comparison group. Mahalanobis distance may in fact be
>> an improvement, so i will look that up
>>
>> Many thanks
>>
>> On Thu, Jun 10, 2010 at 4:50 PM, Austin Nichols <[email protected]> wrote:
>>> Anthony Laverty <[email protected]> :
>>> You didn't give more detail on your problem--what are you going to use
>>> the matches for? Why use the sum of squared differences in each
>>> month, as opposed to, say the Mahalanobis distance over all months
>>> (-reshape- to have T variables measuring # of patients in each month,
>>> and find the closest 15 obs in the standard deviation metric)? That
>>> would match not only on levels but on seasonal patterns, for example.
>>> Is there a regression you plan to run after matching? You may want to
>>> -findit nnmatch- in that case.
>>>
>>> On Thu, Jun 10, 2010 at 11:30 AM, Anthony Laverty
>>> <[email protected]> wrote:
>>>> Hi Austin
>>>>
>>>> That's helpful, thanks, and good points about my memory considerations
>>>> and perhaps using a log scale
>>>>
>>>> Unfortunately, what i really want to be able to do is choose a group
>>>> of hospitals (say 15) which are closest in Euclidean distance terms to
>>>> hospital A over all months, rather than just the one closest hospital.
>>>> I was planning to aggregate these for the whole of the time period at
>>>> the end, if that makes things any easier.
>>>>
>>>> In terms of more detail i'm not sure if it helps to say that this was
>>>> relatively easy to work out in excel, using a different column for
>>>> each time period; a row for each hospital and the number of patients
>>>> for each time period in a table like this. Then, it was quite easy to
>>>> work out the distances with the equation subtracting different
>>>> hospitals' numbers from each other, using if statements to match on
>>>> time. The new data i have is too big for Excel to do this, which is
>>>> why i have turned to stata (and statalist)
>>>>
>>>> Thanks for your consideration
>>>>
>>>> Anthony
>>>>
>>>>
>>>> On Thu, Jun 10, 2010 at 2:59 PM, Austin Nichols <[email protected]> wrote:
>>>>> Anthony Laverty <[email protected]> :
>>>>> If you have N hospitals at T points in time, then you will have NTxN
>>>>> squared distances in your variables, and if they are doubles you may
>>>>> well run out of memory long before that, but if all you want is the
>>>>> nearest hospital, then you want one variable per hospital giving the
>>>>> identity of the nearest (over all months, you seem to suggest). You
>>>>> might also want to compute distance on a log scale, or some other
>>>>> metric. With more detail on your problem, you may get a better answer.
>>>>> Nevertheless, this is like what you asked for, I think:
>>>>>
>>>>> clear
>>>>> input str1 hospital time patients
>>>>> A 1 456
>>>>> A 2 759
>>>>> A 3 236
>>>>> B 1 214
>>>>> B 2 854
>>>>> B 3 325
>>>>> C 1 250
>>>>> C 2 321
>>>>> C 3 852
>>>>> end
>>>>> egen g=group(hospital)
>>>>> su g, mean
>>>>> loc N=r(max)
>>>>> forv i=1/`N' {
>>>>> g double d`i'=.
>>>>> }
>>>>> levelsof time, loc(ts)
>>>>> fillin time g
>>>>> sort time g
>>>>> g long obs=_n
>>>>> qui foreach t of loc ts {
>>>>> su obs if time==`t', mean
>>>>> loc n0=r(min)
>>>>> loc n1=r(max)
>>>>> forv i=`n0'/`n1' {
>>>>> loc n=`i'-`n0'+1
>>>>> replace d`n'=(patients-patients[`i'])^2 if inrange(_n,`n0',`n1')
>>>>> }
>>>>> }
>>>>> l, sepby(time) noo
>>>>>
>>>>> On Thu, Jun 10, 2010 at 5:08 AM, Anthony Laverty
>>>>> <[email protected]> wrote:
>>>>>> Dear Statalist
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have data on patient numbers at various hospitals and am trying to
>>>>>> calculate a new variable which is the Euclidean distance between one
>>>>>> specific hospital (say A) and all of the others, so that i can select
>>>>>> which hospitals had the most similar number of patients across all
>>>>>> months. The data is more or less arranged like this (although it has
>>>>>> a few more columns not of direct interest to this question):
>>>>>>
>>>>>> Hospital Time Patients
>>>>>> A 1 456
>>>>>> A 2 759
>>>>>> A 3 236
>>>>>> B 1 214
>>>>>> B 2 854
>>>>>> B 3 325
>>>>>> C 1 250
>>>>>> C 2 321
>>>>>> C 3 852
>>>>>>
>>>>>>
>>>>>>
>>>>>> So, i want to cycle through each time period and calculate the
>>>>>> difference squared between hospital A and all of the other hospitals
>>>>>> individually as one new variable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any suggestions greatly appreciated
>>>>>>
>>>>>>
>>>>>>
>>>>>> Anthony Laverty
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/