Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: how to do subsampling in stata
From
Stas Kolenikov <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: how to do subsampling in stata
Date
Fri, 16 Aug 2013 11:09:29 -0500
Laszlo, your code (which is probably right for the problem at hand)
assumes asymptotically normal estimates that converge at the rate of
O(n^{-1/2}) for i.i.d. data. In that situation, subsampling, bootstrap
and standard inference (OK, may be with robust/sandwich corrections)
are asymptotically equivalent (although of course the former two would
bring the small sample advantages for inference that is based on
pivotal quantities). I would venture a guess that for the quantiles,
bootstrap is more efficient/more stable, as it uses larger sample
sizes within each replicate. The interesting cases for subsampling,
where the bootstrap might fail, would be difficult estimators, like
the mode which has the convergence rate of O(n^{-1/3}), or kernel
density estimates which have their own weird fractional convergence
rates (http://www.citeulike.org/user/ctacmo/article/575126), or some
other irregular problems
(http://www.citeulike.org/user/ctacmo/article/1318051); if the
bootstrap is known to work (which is often tricky at subtle levels,
see http://www.citeulike.org/user/ctacmo/article/1227040,
http://www.citeulike.org/user/ctacmo/article/12571645), I would stick
with the bootstrap and won't bother with fancy subsampling.
So at the very least the code must be knowledgeable about the rate at
which the statistic in question is pivotal. "The basic theorem" 2.2.1
of Politis, Romano and Wolf's book is VERY general in terms of the
relations between the sample size, the subsampling size, and the
normalizing constants (square roots in your case). The user must
supply all of these as options to the prospective -subsampling-, as I
don't see how the problem can be solved in general: the code must be
agnostic about what it is being applied to, in order to be
generalizable to the extent the existing -bootstrap- code is.
-- Stas Kolenikov, PhD, PStat (ASA, SSC)
-- Senior Survey Statistician, Abt SRBI
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
-- http://stas.kolenikov.name
On Fri, Aug 16, 2013 at 10:37 AM, László Sándor <[email protected]> wrote:
> To make my comments on -bootstrap- more meaningful:
>
> I followed simply Wasserman's blogpost and thought I need to turn the
> quantiles of estimates in subsamples into a confidence interval by
> modifying the corresponding part of _bs_sum.ado the following way:
>
> GetIntChar _dta[size] // this works only because I save this with the
> bootstrap replication data myself
> scalar `size' = r(val)
> GetIntChar _dta[N]
> scalar `obs' = r(val)
> _pctile `x' if `touse', p(`=`p1'', `=`p2'')
> scalar `p1' = `b_i' -(sqrt(`size')*(r(r2)-`b_i'))/sqrt(`obs')
> scalar `p2' = `b_i' -(sqrt(`size')*(r(r1)-`b_i'))/sqrt(`obs')
>
> Note that I had to use both the number of observations in the original
> data and the size of the subsamples. Yes, these cancel out if the two
> are equal. But as -bootstrap- allows the size to differ too, why don't
> _bs_sum.ado need a similar adjustment?
>
> I does look intuitive to my eye, without doing any of the math: If you
> sample small samples, the estimates are noisier, so perhaps you want
> scale down the deviations of the quantiles. Or this is wrong even for
> subsampling? Or the sampling with replacement is what changes this for
> bootstrap? (As some texts of bootstrap call sampling without
> replacement a version of the bootstrap, I wonder if this would matter
> that much.)
>
> The third option is that I don't understand what the percentile CI is
> supposed to be. E.g. I am pretty sure the higher quantile of the
> statistic should matter for the lower bound, and that comes from the
> higher quantile of the estimates, no? Though of course, this does not
> give back my simplest intuition about the CI being the "middle 95%" of
> the estimates as such.
>
> Thanks for any thoughts,
>
> Laszlo
>
> On Fri, Aug 16, 2013 at 6:10 AM, Nick Cox <[email protected]> wrote:
>> B wasn't well worded. Matching and subsampling are not equivalent or
>> parallel. The matching example is intended to show the kind of user
>> commitment that tends to change StataCorp's mind about what should be
>> supported officially.
>> Nick
>> [email protected]
>>
>>
>> On 16 August 2013 09:15, Nick Cox <[email protected]> wrote:
>>> All your points are valid to me, but
>>>
>>> A. At any users' meeting or Stata conference, people will say "Stata
>>> should support X, which is big in field Y, and that would be really
>>> popular and people would buy Stata just to use that!" Meanwhile, one
>>> is looking round the room and there are puzzled faces and people are
>>> muttering to their friends "What's that? Never heard of it." Mostly,
>>> everyone is right, but there is a long list of desires. (Often X is
>>> really big, or an entire approach.)
>>>
>>> B. A big difference with matching is the evident volume of real
>>> interest, shown as sustained activity over a period of years from the
>>> Stata user community: major user-written programs downloaded
>>> frequently, lots of papers and talks, numerous questions on Statalist.
>>> That is a level of commitment not matched by evident interest in
>>> subsampling. Whether everyone is looking in the wrong direction
>>> remains a good question.
>>>
>>> C. StataCorp is very cautious and slow to react on big statistical
>>> additions, arguably in the user community's best interests.
>>> Statistical science, like anything else, is full of five-year fads,
>>> things transiently popular but dropped abruptly when something else
>>> becomes hot, or people see that they have been oversold. StataCorp
>>> doesn't want to spend massive effort on implementing something that
>>> will be quickly superseded in users' affections. Academics tend to
>>> read papers and come to favourable views of something and come to
>>> think "This is great and should be implemented now", but StataCorp
>>> have a different time scale.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 16 August 2013 02:07, László Sándor <[email protected]> wrote:
>>>> Stas, I am not sure I'm with you on this one.
>>>>
>>>> 1. Subsampling looks much, much easier to implement than other novelties.
>>>> 2. Many if not most people use bootstrap not because they derived that
>>>> their estimator is smooth but exactly because they worry that
>>>> something is not exactly canonical in their problem or application,
>>>> but hey, they can just bootstrap it. My admittedly limited
>>>> understanding of the difference between the two methods suggest that
>>>> subsampling is the safer bet.
>>>> 3. The original (?) question on Statalist even mentioned that Abadie
>>>> and Imbens tried to warn people that matching is exactly a problem
>>>> where the bootstrap can be problematic, while subsampling they
>>>> recommend. With version 13, Stata became a matching powerhouse. Why
>>>> not support this simple thing, then?
>>>> http://www.stata.com/statalist/archive/2009-04/msg00920.html
>>>>
>>>> Best,
>>>>
>>>> Laszlo
>>>>
>>>> On Thu, Aug 15, 2013 at 7:13 PM, Stas Kolenikov <[email protected]> wrote:
>>>>> On Thu, Aug 15, 2013 at 12:12 PM, Phil Schumm <[email protected]> wrote:
>>>>>> On Aug 15, 2013, at 11:45 AM, László Sándor <[email protected]> wrote:
>>>>>>> Or of course, if StataCorp reading this is confident about how easy the transition from -bsample- to -sample- would be for a clone of -bootstrap-
>>>>>>
>>>>>> I'm not familiar with the literature on the subsampling, so what I'm about to say may not entirely apply here. However, it is worth noting that a lot of what StataCorp does is not simply implementing estimators and methods, but is making sure that the theory behind them is sound, and that the various things users might do once the method is implemented in Stata are reasonable. Thus, even though it might be fairly simple for a user to patch an existing command to accommodate a specific situation (for which they are willing to take full responsibility), it might take StataCorp longer to verify for themselves that the enhancement is really something with which they feel comfortable.
>>>>>
>>>>> Of many other wonderful theoretical developments in statistics and
>>>>> econometrics, why not (a) empirical likelihood and exponential
>>>>> tilting? (b) block bootstrap for time series? (c) delete-k jackknife
>>>>> for complex survey data? (d) degrees of freedom corrections in mixed
>>>>> models? (e) tetrad analysis in latent variable models? and an endless
>>>>> wish list follows. Each of these are well established in their
>>>>> specific literature, but their use is required in a fairly limited
>>>>> range of situations. It took Stata Corp about 10 years from seeing the
>>>>> first user-written multiple imputation and generalized linear latent
>>>>> variable and mixed model pacakges (-ice/mim- and -gllamm-, of course)
>>>>> to the production versions of these (-mi-, -meglm- and -gsem-), and
>>>>> these have three order of magnitude greater generalizability and
>>>>> potential user base than subsampling (which is really called for in
>>>>> weird situations with non-smooth estimators, so one needs to put a lot
>>>>> of work to even produce such an estimator) or empirical likelihood
>>>>> (which is asymptotically equivalent to the existing -gmm-, anyway).
>>>>>
>>>>> That's a long introduction to say that I would not expect to see Stata
>>>>> Corp working on this for the next three or so releases. If Laszlo's
>>>>> needs are more urgent, he should start working on his own
>>>>> implementation of subsampling. As I did with empirical likelihood :).
>>>>>
>>>>> -- Stas Kolenikov, PhD, PStat (ASA, SSC)
>>>>> -- Senior Survey Statistician, Abt SRBI
>>>>> -- Opinions stated in this email are mine only, and do not reflect the
>>>>> position of my employer
>>>>> -- http://stas.kolenikov.name
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/