Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Michael Stepner <stepner@mit.edu> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: new package -fastxtile- available in SSC |
Date | Mon, 7 Oct 2013 08:52:24 -0400 |
Thanks for letting me know, David. I'm going to get to the bottom of this, and release an update that corrects my claims accordingly. At a first pass in the few minutes I have this morning, it seems to be a numerical precision issue. I added a -return list- after fastxtile in your code, and then compared the reported quantile boundaries to the discrepant observations identified by -list if xt != fxt-. The observation that "hops the fence" in each case you documented is identical to one of the quantiles in eight significant digits. The first thing I'll check is whether this difference is being caused by xtile/fastxtile using a float where the other uses a double. Michael On 7 October 2013 04:14, David Muller <davidmull@gmail.com> wrote: > Hi Michael, > > This looks great, and it is certainly much faster than built in > -xtile- when operating on a lot of observations! > > One thing to note is that -fastxtile- does not necessarily produce > identical results to -xtile-. This seems to occur for values that are > essentially equal to a quantile cutpoint: > > ************************************** > clear > set seed 300 > set obs 10 > gen x = rnormal() > fastxtile fxt = x, nq(6) > xtile xt = x, nq(6) > assert xt == fxt > list if xt != fxt > > // And a larger example > clear > set obs 10000000 > gen x = rnormal() > fastxtile fxt = x, nq(6) > xtile xt = x, nq(6) > assert xt == fxt > list if xt != fxt > ************************************** > > > All the best, > David > > On 6 October 2013 23:02, Michael Stepner <stepner@mit.edu> wrote: >> -fastxtile- is a Stata routine to create a variable of quantile >> categories. It is now available in the SSC, with thanks to Kit Baum. >> >> fastxtile is a drop in replacement for the built-in Stata program >> xtile. It has the same syntax and produces identical results, but the >> process has been altered to be more computationally efficient. The >> difference in running time is substantial in large datasets. >> >> fastxtile also has a few added features. It supports computing the >> quantile boundaries using a random sample of the data, which further >> increases the speed, but generates approximate quantiles due to >> sampling error. fastxtile can also create categories based on a >> user-specified numlist, rather than computing the quantile boundaries >> itself. >> >> For anyone currently using -xtile- with large datasets, -fastxtile- is >> worth checking out. It has no downside, and runs significantly >> faster. >> >> If you're interested, you can install the program via -ssc install fastxtile-. >> >> Best regards, >> Michael >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/