Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏

From	"Millimet, Daniel" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
Date	Fri, 5 Oct 2012 03:09:15 +0000

The same data-generating process and censoring applies even to variables that "cannot" be, say, less than zero.  Suppose we assume labor supply is determined by

Y=xb+e, e~N(0,s2)

But, since labor supply cannot be negative, we call Y in the above DGP, the latent Y*, which can take on any number in the real number line.  If we don't relabel Y as Y*, then you need to impose the bound at 0 some way in the assumed DGP.  So, now we have

Y*=xb+e, e~N(0,s2)

But the observed Y = Y* if Y*>0 and and Y=0 if Y*<=0.

Basically, the point is that prior to discussing an estimator, you need to be clear on what DGP you assume generates the data such that values below 0 are not feasible.  The latent framework that corresponds to the tobit is one such DGP that models the mass at zero, and is consistent with the observed Y being strictly non-negative.  

****************************************************
Daniel L. Millimet, Professor
Department of Economics
Box 0496
SMU
Dallas, TX 75275-0496
phone: 214.768.3269
fax: 214.768.1821
web: http://faculty.smu.edu/millimet
****************************************************


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joerg Luedicke
Sent: Thursday, October 04, 2012 9:52 PM
To: [email protected]
Subject: st: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏

I find it difficult to understand why you would regard a variable as censored, when it actually isn't?

Let's assume the outcome variable (y) is income and the predictor variable (x) is years of education. We generate some data for which the expected income for people without education is 500 and, on average, persons earn 300 more per year of education:
*-------------------
clear
set obs 1000
set seed 1234
gen x=rnormal(10,3)
gen e=rnormal(0,20)
gen y=500+300*x+e
*-------------------

Fitting a linear model to these data yields the expected parameters:
*-------------------
reg y x
*-------------------

Now suppose income was only measured exactly for amounts of 3,000 or more, so in this case y is censored from below at a value of 3,000:
*-------------------
gen cy=y
replace cy=3000 if y<3000
*-------------------

If we fit the simple linear model to these data now, the results are obviously bad:
*-------------------
reg cy x
*-------------------

However, if we use the Tobit model, we can again recover the correct parameters:
*-------------------
tobit cy x, ll(3000)
*-------------------

So the Tobit model makes a lot of sense here and seems useful in an otherwise possibly unpleasant situation, given the censored outcome.
However, if an outcome is simply bounded at zero, like for example expenditure data, then such variables are not censored: a zero is just a zero; not more and not less. So why would it be advisable to use a censored regression model when the outcome is not censored? For me, that would only make sense if, say, the model shares some other hidden qualities and generally does well when analyzing bounded data. But this does not even seem to be the case if we consider Austin Nichols'
(2010) simulation results for nonnegative skewed data.

Joerg


References:

Nichols , A, 2010. Regression for nonnegative skewed dependent variables, BOS10 Stata Conference 2, Stata Users Group.
URL: http://repec.org/bost10/nichols_boston2010.pdf



On Thu, Oct 4, 2012 at 8:08 PM, Millimet, Daniel <[email protected]> wrote:
> Yes, in my opinion, if you include the zeros, a fractional logit or tobit or censored LAD is appropriate (given the other assumptions implicit in these models).  The only issue is whether some Xs are missing for the zeros.  That you will have to confront yourself if you have Xs you want to include that are missing from some obs.
>
> ****************************************************
> Daniel L. Millimet, Professor
> Department of Economics
> Box 0496
> SMU
> Dallas, TX 75275-0496
> phone: 214.768.3269
> fax: 214.768.1821
> web: http://faculty.smu.edu/millimet
> ****************************************************
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Ebru Ozturk
> Sent: Thursday, October 04, 2012 5:32 PM
> To: [email protected]
> Subject: RE: st: Re: st: Re: st: RE: Truncated sample or Heckman 
> selection‏
>
> Thank you. It will be quite complicated for me to understand this e-mail.
>
> Yes, in my data there is a mass at zero and I include all of them. So you are saying that it is a censoring problem and tobit regression is applicable or a fractional logit model?
>
> The other issue about Xs. The Xs that I am interested in have not been observed for non-innovator firms but there are other Xs that I use them as control variable have been observed for all firms in the sample.
>
> Ebru
>
> ----------------------------------------
>> From: [email protected]
>> To: [email protected]
>> Subject: RE: st: Re: st: Re: st: RE: Truncated sample or Heckman 
>> selection‏
>> Date: Thu, 4 Oct 2012 22:16:57 +0000
>>
>> If you include all firms in a model, with a mass at zero, then is the standard censoring problem. Labor supply models are classic model. Labor supply has a "natural" lower bound at zero, but one does not use OLS. Typically, tobit models are used or semiparametric alternatives like censored LAD or symmetric trimmed least squares. See, for example, Wilhelm (OBES, 2008, "Practical Considerations for Choosing Between Tobit and SCLS or CLAD Estimators for Censored Regression Models with an Application to Charitable Giving"). For percentages, even though these variables are by definition between 0 and 1 (or 100), a fractional logit is the most common model, I believe, if there is a mass at either boundary point.
>>
>> So, in your case, if you include the zeros, yes it is a censoring problem.
>>
>> Th next issue is what Xs you observe for different observations. If all Xs were observed for all obs (0 and positive values), then a fractional logit is the answer (or a tobit or one of the above alternatives). If SOME of the Xs are missing for the obs at zero, then you can (i) drop the zeros and estimate a selection-corrected OLS model - if you ignore the upper limit of 100 - or you can combine the selection correction with a fractional logit/probit model, as long as you are sure the control function term for the correction is correct (this is what some empirical trade papers do when they drop country pairs with zero trade; although it is not recommended), or (ii) include the zeros, but you need two different equations for the zeros and the non-zeros since it sounded like not all Xs are available for the obs at zero. So, something like a hurdle (zero-inflated) model tailored to your example.
>>
>> **********************************************
>> Daniel L. Millimet, Professor
>> Department of Economics
>> Box 0496
>> SMU
>> Dallas, TX 75275-0496
>> phone: 214.768.3269
>> fax: 214.768.1821
>> web: http://faculty.smu.edu/millimet
>> **********************************************
>>
>> ________________________________________
>> From: [email protected] 
>> [[email protected]] on behalf of Ebru Ozturk 
>> [[email protected]]
>> Sent: Thursday, October 04, 2012 4:53 PM
>> To: [email protected]
>> Subject: RE: st: Re: st: Re: st: RE: Truncated sample or Heckman 
>> selection‏
>>
>> Innovation success is heavily left-censored - many firms do not have any market novelties and thus no sales from this type of innovation (Grimpe & Kaiser, 2010).
>>
>> Is that wrong then?
>>
>> I'm really confused now.
>>
>> Ebru
>>
>> ----------------------------------------
>> > Date: Thu, 4 Oct 2012 16:45:59 -0500
>> > Subject: st: Re: st: Re: st: RE: Truncated sample or Heckman 
>> > selection‏
>> > From: [email protected]
>> > To: [email protected]
>> >
>> > On Thu, Oct 4, 2012 at 4:34 PM, Ebru Ozturk <[email protected]> wrote:
>> > > For Tobit regression, the dependent variable is the percent of total firm sales revenues that derived from the sales of new products. Therefore, it is censored as sales of new products can only be zero or positive.
>> > >
>> > This just isn't a censoring problem. Consider having a look at:
>> >
>> > http://en.wikipedia.org/wiki/Censoring_%28statistics%29
>> >
>> > Joerg
>> > *
>> > * For searches and help try:
>> > * http://www.stata.com/help.cgi?search
>> > * http://www.stata.com/support/faqs/resources/statalist-faq/
>> > * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Re: st: RE: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Maarten Buis <[email protected]>

References:
- st: Truncated sample or Heckman selection‏
  - From: Ebru Ozturk <[email protected]>
- st: RE: Truncated sample or Heckman selection‏
  - From: "Millimet, Daniel" <[email protected]>
- st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Nick Cox <[email protected]>
- RE: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Ebru Ozturk <[email protected]>
- st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Joerg Luedicke <[email protected]>
- RE: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Ebru Ozturk <[email protected]>
- RE: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: "Millimet, Daniel" <[email protected]>
- RE: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Ebru Ozturk <[email protected]>
- RE: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: "Millimet, Daniel" <[email protected]>
- st: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
  - From: Joerg Luedicke <[email protected]>

Prev by Date: Re: st: Re: Unable to clear "invalid syntax r(197);" error in user-written .ado file
Next by Date: RE: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
Previous by thread: st: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
Next by thread: st: Re: st: RE: Re: st: Re: st: Re: st: RE: Truncated sample or Heckman selection‏
Index(es):
- Date
- Thread