-heckman- and -zip- are both trying to deal with too many zeroes (and
so does -tobit-, but it puts just too many assumptions in... although
originally it was developed for the expenditure models). -zip- says
that for some reason, there is a probability of hitting zero before
the rest of Poisson kicks in. -heckman- says that there is selection
and (unobserved) utility functions at work. The selection models are
more of the behavioral flavor, while zip models are more of the
descriptive, if not population-averaging, nature, without trying to
explain why certain people did or did not participate in <whatever>.
Arguably, you can put a model similar to Heckman's model to hospital
expenditure, too: if a person does not have (good enough) insurance,
they may not be able to afford hospitalization, and choose not to go.
If the (total discounted) budget is less than the predicted hospital
bill, then we observe zero hospitalization costs. So there is a
similar utility / budget interplay, and arguably Mills' ratio does
belong in the linear regression part.
Alternatively one can say that there are healthy people and sick
people -- the former are spending zero on hospitals, and others spend
some non-zero amounts, with the implicit assumption of perfect markets
and absence of budget constraints. This does not seem quite right to
me, but I can imagine there are occasions where that's how things
might be working.
In reality, both things should be at play: "too low" expenditure for
the healthy, and "too high" expenditure for the poor. Ideally both
should be modelled (and neither "true" expenditure is observed), but I
am not aware of any models that are aimed specifically at that.
On Tue, Aug 19, 2008 at 2:55 PM, Austin Nichols <[email protected]> wrote:
> Shehzad Ali <[email protected]>:
> An approach using -heckman- is discussed in the Mullahy ref mentioned
> earlier (http://www.nber.org/papers/t0228), I believe, along with
> -tobit-.
> If the conditional distribution of y seems to fall in two large
> groups, one at zero and one at higher values, with zero density in
> between, there may be more justification for one of the two-part types
> of models where a case is either zero or nonzero, and then the nonzero
> values are determined by a possibly different process.
> If you want to model ln(y) as a function of X, so ln(y) for y=0 is
> missing, then you might prefer -heckman-; if you want to model y as a
> function of X in one of those models, so y=0 is the lower limit, then
> you might prefer -tobit-, but both models incorporate a normality
> assumption that is usually violated in practice... see the Stata
> reference manuals and cited works for more discussion of the
> identifying assumptions.
>
> Presumably your two sets of expenditure data are for the same
> individuals, and exhibit correlated errors, so -nlsur- rather than
> -glm- may be in order.
>
> On Tue, Aug 19, 2008 at 1:33 AM, Shehzad Ali <[email protected]> wrote:
>> Thank you all for your very useful thoughts on this issue.
>> I am running regression on two separate sets of expenditure data: one for
>> general health expenditure which includes all costs including those for
>> self-medication etc., and second for expenditure related to formal health
>> care, including primary and hospital care but excluding self-medication.
>>
>> I agree that two-part model is not the best option but is -heckman- model a
>> resaonable alternative if the selection step is for zero/non-zero
>> expenditure and outcome for the positive expenditure? Looking at Austin's
>> argument, I understand that -heckman- run into similar problem as two-part
>> model. Is that right?
>>
>> Shehzad
>> On Aug 18 2008, Austin Nichols wrote:
>>
>>> In expectation? People who have truly zero probability of incurring
>>> hospital costs?
>>>
>>> On Mon, Aug 18, 2008 at 1:08 PM, Lachenbruch, Peter
>>> <[email protected]> wrote:
>>>>
>>>> The problem was about hospitalization costs. These can be true zeros.
>>>>
>>>> Tony
>>>>
>>>> Peter A. Lachenbruch
>>>> Department of Public Health
>>>> Oregon State University
>>>> Corvallis, OR 97330
>>>> Phone: 541-737-3832
>>>> FAX: 541-737-4001
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of Austin
>>>> Nichols
>>>> Sent: Monday, August 18, 2008 9:38 AM
>>>> To: [email protected]
>>>> Subject: Re: st: stata code for two-part model
>>>>
>>>> Peter <[email protected]>:
>>>> I think this claim is a bit of a red herring: "use of a continuous
>>>> model for data in which there is a clump of zeros seems incorrect."
>>>> Note that the -glm- approach assumes the mean of y given observables X
>>>> is nonzero, and E(y|X)=exp(Xb), not that observed y is nonzero!
>>>> Including the observations where y=0 is the whole point of the -glm-
>>>> approach--otherwise we would run ols regression of ln(y) on X. And if
>>>> you are claiming that the "true" model for (expected) healthcare
>>>> expenditures does have true zeros that are identifiable, then I
>>>> disagree. Some of your obs may spend nothing on health care (though
>>>> annual spending, including myriad items such as aspirin, is unlikely
>>>> to truly be zero for anyone) but that does not mean their conditional
>>>> mean should be zero. Maybe people who are dead have a conditional
>>>> mean of zero, but they should probably be excluded from the
>>>> analysis...
>>>>
>>>> When spending is measured in discrete dollars, a big clump of people
>>>> who have predicted spending less than 50 cents may have a conditional
>>>> mean of zero measured in the same units as the data. But that does
>>>> not mean their "true" conditional mean is zero.
>>>>
>>>> That said, a demand/expenditure model will have more and more "true"
>>>> (or rounded off) zeros as the category of demand/expenditure gets
>>>> narrower and narrower and the time window over which it is measured
>>>> gets narrower... think aspirin expenditures by week or day... but it
>>>> is not clear to me that a two-part model is the right approach even in
>>>> those cases.
>>>>
>>>> On Mon, Aug 18, 2008 at 11:33 AM, Lachenbruch, Peter
>>>> <[email protected]> wrote:
>>>>>
>>>>> In some instances, the model for healthcare expenditures does have
>>>>
>>>> true
>>>>>
>>>>> zeros that are identifiable. In one study I consulted on the data
>>>>
>>>> came
>>>>>
>>>>> from a health insurer, and zeros were people who had not gone to
>>>>> hospital.
>>>>>
>>>>> The use of a continuous model for data in which there is a clump of
>>>>> zeros seems incorrect. There is no transformation that can remove
>>>>
>>>> this
>>>>>
>>>>> clump. The severity of the problem depends a bit on the size of the
>>>>> clump. In the hospital insurance data (wanting to estimate
>>>>> hospitalization costs in the policy holders) 95% of the population had
>>>>> no costs. Pretending that these were continuous would lead to some
>>>>> nonsense results. At the present time, I have a data set that has 32
>>>>> out of 145 people with zeros. However, these are not necessarily
>>>>> identifiable since they could be slightly greater than zero. I'm
>>>>> gritting my teeth on this and pretending all is well. However, a
>>>>> histogram shows enormous skewness. I'll probably try a square root.
>>>>>
>>>>> Tony
>>>>>
>>>>> Peter A. Lachenbruch
>>>>> Department of Public Health
>>>>> Oregon State University
>>>>> Corvallis, OR 97330
>>>>> Phone: 541-737-3832
>>>>> FAX: 541-737-4001
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected]
>>>>> [mailto:[email protected]] On Behalf Of Austin
>>>>> Nichols
>>>>> Sent: Saturday, August 16, 2008 8:50 AM
>>>>> To: [email protected]
>>>>> Subject: Re: st: stata code for two-part model
>>>>>
>>>>> Shehzad Ali et al. --
>>>>> See also
>>>>> http://www.nber.org/papers/t0228
>>>>> The two part models of health expenditures have always struck me as a
>>>>> bad idea; think about how you would get predictions for each indiv in
>>>>> your sample. The "stage 1" probit classifies people as having
>>>>> expenditures or not (some correctly, some not) and then the "stage 2"
>>>>> ols model gives predicted expenditures only for those people who
>>>>> actually have positive expenditures (not those who are classified by
>>>>> the probit as likely to have positive expenditures) unless you predict
>>>>> out of sample. At least one preferred approach of calculating
>>>>> marginal effects by comparing predictions over the whole sample turns
>>>>> out to be practically and analytically difficult in that setting.
>>>>> However, a -glm- with a log link (or equivalently a -poisson-
>>>>> regression) has no trouble: those people with extremely low predicted
>>>>> expenditures would round to zero predicted expenditures if you thought
>>>>> about a survey with expenditures measured discretely in dollars, say.
>>>>> Everyone has E(y)=exp(Xb) and there is no real issue with calculating
>>>>> marginal effects. Once you are in the -glm- framework it is also easy
>>>>> to think about model fit and alternative links...
>>>>>
>>>>> On Sat, Aug 16, 2008 at 3:41 AM, Eva Poen <[email protected]> wrote:
>>>>>>
>>>>>> Shehzad,
>>>>>>
>>>>>> this looks like a hurdle model. Have you search the ssc archives to
>>>>>> see if someone else has programmed it for you? Have a look at
>>>>>> -hplogit-, for example.
>>>>>>
>>>>>> If you end up doing it yourself, I think you need to do a bit of
>>>>>> programming. In order for -mfx- to work after your estimation, you
>>>>>> need a way of telling it what you want the marginal effects to be
>>>>>> calculated for. In your case, this would be the overall expected cost
>>>>>> of care from your model. The way to feed this to -mfx- is via the
>>>>>> predict(predict_option), but for this to work you need to write a
>>>>>> -predict- command and an estimation command for your model.
>>>>>>
>>>>>> See for example this post:
>>>>>> http://www.stata.com/statalist/archive/2005-10/msg00091.html
>>>>>>
>>>>>> Hope this helps,
>>>>>> Eva
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2008/8/16 Shehzad Ali <[email protected]>:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering if someone can help with stata code for calculating
>>>>>
>>>>> marginal
>>>>>>>
>>>>>>> effects after two-part models for say, cost of care. Here, first
>>>>
>>>> part
>>>>>
>>>>> is a
>>>>>>>
>>>>>>> probit model for seeking care or not, and the second part is an OLS
>>>>>
>>>>> model of
>>>>>>>
>>>>>>> cost of care, conditional on decision to seek care. Here is the
>>>>>
>>>>> simplified
>>>>>>>
>>>>>>> code:
>>>>>>>
>>>>>>> probit care $xvar
>>>>>>>
>>>>>>> reg cost $zvar if care==1
>>>>>>>
>>>>>>> mfx
>>>>>>>
>>>>>>> I understand that mfx after the second part gives us the marginal
>>>>>
>>>>> effects
>>>>>>>
>>>>>>> for the OLS part only, and not the conditional marginal effects.
>>>>>>>
>>>>>>> Any help would be appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Shehzad
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/statalist/faq
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/statalist/faq
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> --
>> Shehzad I Ali
>> Department of Social Policy & Social Work
>> University of York
>> YO10 5NG
>> +44 (0) 773-813-0094
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/