Tony <[email protected]> :
I suppose the model required depends on what question the poster
wishes to answer, but there is no clear advantage of a logit or probit
over a poisson in this case unless you have no interest in the
variation in positive outcomes or you suspect overdispersion is a
serious issue even conditional on X which implies you have other count
models to use; note that heterosk. or measurement error in the binary
outcome or individual heterogeneity are all a much bigger deal in the
logit/probit world.
It may be that having no visits seems different from having one or
more but having one visit also seems different from having two or
more. Where does that reasoning stop? If your expected number of
visits conditional on X is 0.01 then odds are you have no visits this
month; you might have one, but you are very unlikely to have six. If
your expected number of visits conditional on X is 1 then odds are
still good you have no visits this month; you might have one, and you
are not terribly unlikely to have six. The reasoning all gets easier
in a poisson model IMHO.
A "preponderance of zeros" just means the mean Xb is low, as is to be
expected. All too often, the long right tail is predictable from
various X variables in the data, so conditional on X, the poisson
variance may be closer to correct; if it isn't, you may need a richer
model! Or program up the "Flexible Regression Model for Count Data"
(Kimberly F. Sellers and Galit Shmueli) with under- and
overdispersion.
Above all, why try to implement some kind of selection correction when
you can just avoid the selection in the first place?
On Fri, Jun 5, 2009 at 12:28 PM, Lachenbruch, Peter
<[email protected]> wrote:
> I think the situations may be distinct: having no hospital visits seems different from having one or more. If these are not part of a mixture distribution (i.e., 0 visits is identifiable) one can estimate the probability of a person having 0 visits and then the count of number of non-zero visits. If not identifiable, one can use zero-inflated Poisson or zero-inflated negative binomial.
>
> The problem seems to separate naturally into the two parts. If you want a mean number of visits you can get it, but I'm unsure of the interpretation since there's a fraction that don't have any visits that is greater than that expected under the Poisson model. In one dissertation, a student had 95% zeros and the rest were positive. The idea was to predict costs of hospitalization - this had big implications for insurance companies. In this case, the likelihood of finding hospitalization in a household survey may also have a preponderance of zeros.
>
> Tony
>
> Peter A. Lachenbruch
> Department of Public Health
> Oregon State University
> Corvallis, OR 97330
> Phone: 541-737-3832
> FAX: 541-737-4001
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Austin Nichols
> Sent: Friday, June 05, 2009 9:15 AM
> To: [email protected]
> Subject: Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models
>
> John Ataguba <[email protected]> :
>
> Again, why split the analysis? If you are interested in the count,
> use a count model, and then talk about what the results from that
> model predict about the probability of a nonzero count when you are
> interested in whether people have any visits. You don't seem to have
> any theory requiring "standard logit/probit model" assumptions.
> -poisson- seems the natural starting point.
>
> Why would you drop the zeros when trying to assess how many GP visits
> a person seems likely to make conditional on X? Zero is one possible
> outcome...
>
> On Fri, Jun 5, 2009 at 10:03 AM, John Ataguba <[email protected]> wrote:
>> Hi Austin,
>>
>> Specifically, I am not looking at the time dimension of the visits. The data set is such that I have total number of visits to a GP (General Practitioner) in the past one month collected from a national survey of individuals. Given that this is a household survey, there are zero visits for some individuals.
>>
>> One of my objective is to determine the factors that predict positive utilization of GPs. This is easily implemented using a standard logit/probit model. The other part is the factors that affect the number of visits to a GP. Given that the dependent variable is a count variable, the likely candidates are count regression models. My fear is with how to deal with unobserved heterogeneity and sample selection issues if I limit my analysis to the non-zero visits. If I use the standard two-part or hurdle model, I do not know if this will account for sample selection in the fashion of Heckman procedure.
>>
>> I think the class of mixture models (fmm) will be an anternative that I want to explore. I don't know much about them but will be happy to have some brighter ideas.
>>
>> Regards
>>
>> Jon
>>
>>
>> ----- Original Message ----
>> From: Austin Nichols <[email protected]>
>> To: [email protected]
>> Sent: Friday, 5 June, 2009 14:27:20
>> Subject: Re: st: RE: AW: Sample selection models under zero-truncated negative binomial models
>>
>> Steven--I like this approach in general, but from the original post,
>> it's not clear that data on the timing of first visit or even time at
>> risk is on the data--perhaps the poster can clarify? Also, would you
>> propose using the predicted hazard in the period of first visit as
>> some kind of selection correction? The outcome is visits divided by
>> time at risk for subsequent visits in your setup, so represents a
>> fractional outcome (constrained to lie between zero and one) in
>> theory, though only the zero limit is likely to bind, which makes it
>> tricky to implement, I would guess--if you are worried about the
>> nonnormal error distribution and the selection b
>>
>> Ignoring the possibility of detailed data on times of utilization, why
>> can't you just run a standard count model on number of visits and use
>> that to predict probability of at least one visit? One visit in 10
>> years is not that different from no visits in 10 years, yeah? It
>> makes no sense to me to predict utilization only for those who have
>> positive utilization and worry about selection etc. instead of just
>> using the whole sample, including the zeros. I.e. run a -poisson- to
>> start with. If you have a lot of zeros, that can just arise from the
>> fact that a lot of people have predicted number of visits in the .01
>> range and number of visits has to be an integer. Zero inflation or
>> overdispersion also can arise often from not having the right
>> specification for the explanatory variables... but you can also move
>> to another model in the -glm- or -nbreg- family.
>>
>> On Tue, Jun 2, 2009 at 1:21 PM, <[email protected]> wrote:
>>> A potential problem with Jon's original approach is that the use of
>>> services is an event with a time dimension--time to first use of
>>> services. People might not use services until they need them.
>>> Instead of a logit model (my preference also), a survival model for
>>> the first part might be appropriate.
>>>
>>> With later first-use, the time available for later visits is reduced,
>>> and number of visits might be associated with the time from first use
>>> to the end of observation. Moreover, people with later first-visits
>>> (or none) might differ in their degree of need for subsequent visits.
>>>
>>> To account for unequal follow-up times, I suggest a supplementary
>>> analysis in which the outcome for the second part of the hurdle model
>>> is not the number of visits, but the rate of visits (per unit time at
>>> risk).
>>>
>>> -Steve.
>>>
>>> On Tue, Jun 2, 2009 at 12:22 PM, Lachenbruch, Peter
>>> <[email protected]> wrote:
>>>> This could also be handled by a two-part or hurdle model. The 0 vs. non-zero model is given by a probit or logit (my preference) model. The non-zeros are modeled by the count data or OLS or what have you. The results can be combined since the likelihood separates (the zero values are identifiable - no visits vs number of visits).
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:[email protected]] On Behalf Of Martin Weiss
>>>> Sent: Tuesday, June 02, 2009 7:02 AM
>>>> To: [email protected]
>>>> Subject: st: AW: Sample selection models under zero-truncated negative binomial models
>>>>
>>>> *************
>>>> ssc d cmp
>>>> *************
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: [email protected]
>>>> [mailto:[email protected]] Im Auftrag von John Ataguba
>>>> Gesendet: Dienstag, 2. Juni 2009 16:00
>>>> An: Statalist statalist mailing
>>>> Betreff: st: Sample selection models under zero-truncated negative binomial
>>>> models
>>>>
>>>> Dear colleagues,
>>>>
>>>> I want to enquire if it is possible to perform a ztnb (zero-truncated
>>>> negative binomial) model on a dataset that has the zeros observed in a
>>>> fashion similar to the heckman sample selection model.
>>>>
>>>> Specifically, I have a binary variable on use/non use of outpatient health
>>>> services and I fitted a standard probit/logit model to observe the factors
>>>> that predict the probaility of use.. Subsequently, I want to explain the
>>>> factors the influence the amount of visits to the health facililities. Since
>>>> this is a count data, I cannot fit the standard Heckman model using the
>>>> standard two-part procedure in stata command -heckman-.
>>>>
>>>> My fear now is that my sample of users will be biased if I fit a ztnb model
>>>> on only the users given that i have information on the non-users which I
>>>> used to run the initial probit/logit estimation.
>>>>
>>>> Is it possible to generate the inverse of mills' ratio from the probit model
>>>> and include this in the ztnb model? will this be consistent? etc...
>>>>
>>>> Are there any smarter suggestions? Any reference that has used the similar
>>>> sample selection form will be appreciated.
>>>>
>>>> Regards
>>>>
>>>> Jon
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/