Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: approximate quantiles in Stata
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: approximate quantiles in Stata
Date
Sun, 25 Aug 2013 11:26:22 -0400
Thanks, David.
I think I found a reference about quantiles from downsampling, only
with a little clarification needed. I think I see the point about why
the size of the sample matters and not the sampling rate. See the
discussion in this CStheory answer:
http://cstheory.stackexchange.com/a/18734/17375
Or staying closer to our stats brethren, I edited my question on Cross
Validate with my current concerns:
http://stats.stackexchange.com/questions/68208/how-should-sampling-ratios-to-estimate-quantiles-change-with-population-size
On the point about the tails: I think equal-sized bins provide a good
summary of a distribution, or binned scatter plots of correlations.
Unequal bins can confuse people just as much about about the unequal
precision of means of the bins (true, it is not only the bin size that
drives it, the tails will have larger standard deviations).
But thanks!
On Sun, Aug 25, 2013 at 9:28 AM, David Hoaglin <[email protected]> wrote:
> Laszlo,
>
> In line with the information you received from the CS people, running
> time for a good sorting algorithm is O(N x log N) (assuming the data
> are not the worst case). If Stata is sorting the entire sample in
> preparation for extracting quantiles, and N is "tens of millions,"
> that probably explains the long execution times.
>
> The earlier work that I had in mind was by John Chambers (1971, 1977)
> on "partial sorting." From a quick look at the blog post that you
> cited, I think those folks are doing something similar. I don't know
> whether they connect with that earlier work.
>
> If you are dealing with unstructured samples, you should be able to
> get a good approximation of "vingtiles" from a subsample of manageable
> size. (If you know that a sample has structure, you can stratify on
> that structure and then subsample.) If you're prepared to consider
> your large sample as a population, then you would be estimating its
> quantiles from a sample, but not necessarily a small sample. What
> matters is the size of the sample (n), not (usually) the fraction of
> the population (n/N), because the variance of a sample quantile is
> proportional to 1/n. Thus, a sample size of 10,000 or even 1,000
> might be adequate, depending on how accurate you want the estimates to
> be.
>
> You mentioned 20 bins, but it is often useful to select a set of
> quantiles that puts more emphasis on the tails of the sample and
> distribution. One such set, from Exploratory Data Analysis, consists
> of the median and the pairs (lower and upper) whose tail areas are
> (roughly) powers of 1/2. The tails are often "where the action is."
> More on this topic in another message.
>
> David Hoaglin
>
>
> Chambers, J.M. (1971). Algorithm 410: Partial sorting. Communications
> of the A.C.M. 14:357-358.
>
> Chambers, J.M. (1977). Computational Methods for Data Analysls. New
> York: Wiley.
>
> On Sat, Aug 24, 2013 at 7:42 AM, László Sándor <[email protected]> wrote:
>> Thanks, David,
>>
>> I think the typical use case is about tens of millions of
>> observations. (And as I think it matters for precision, the typical
>> case is about 20 bins, or vingtiles.)
>>
>> FWIW, I also tried to profile -xtile- with maximum number of
>> observations possible. With Stata 13 running on 64 cores, it took 7
>> hours to generate vingtiles.
>>
>> Laszlo
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/