Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: RE: Stata treatment of sort order
From
Andrew Maurer <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: RE: Stata treatment of sort order
Date
Thu, 6 Mar 2014 21:00:40 +0000
Rich,
Thanks for this reference. This is interesting, since I don't know how Stata could sort datasets without the "`: sortedby'" flag "instantly". Wouldn't the sort on an already sorted set take at least O(n)? (ie: doesn't the program need to loop once and verify that x[i] <= x[i+1] for i from 1 to _N-1?)
Sarah,
Thanks for the response. However, here's an example of an unsorted list, with repeated values of the sort variable, where the final sort order is always the same after --sort x--. This seems like it contradicts the documentation's assertion that, "the ordering of observations with equal values of varlist is randomized". Perhaps "sometimes randomized" would be more appropriate.
****** Begin code ******
clear all
set obs 10
gen id = _n
gen x = 1 in 1/9
replace x = 0 in 10
sort x
****** End code ********
Output is always:
id x
10 0
9 1
8 1
7 1
6 1
5 1
4 1
3 1
2 1
1 1
Andrew Maurer
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Richard Goldstein
Sent: Thursday, March 06, 2014 2:20 PM
To: [email protected]
Subject: Re: st: RE: Stata treatment of sort order
actually, the manual specifically deals with this: "Stata may be
dumb, but it is also fast. It sorts already-sorted datasets instantly,
so Stata's ignorance costs us little." p. 603
Rich
On 3/6/14, 3:14 PM, Sarah Edgington wrote:
> Andrew,
> In the example in your second question you're asking Stata to sort the data
> on a variable on which it is already sorted. In that case I would not
> expect Stata to change the ordering of the data at all, with or without the
> stable option. Even though you're pasting in new data (so Stata has no
> knowledge of the existing sort order) I would expect that the sorting
> algorithm would do some checking of whether the data was already in the
> order you requested. Since it is already sorted in that order, I wouldn't
> expect the data to be changed. Admittedly that's just a guess since I don't
> have any information on how Stata implements sorting, but it would explain
> the behavior.
>
> However, you can see that if the data is NOT already sorted on the variable
> of interest that the sort order does change over multiple sorts. For
> example, using the auto data, try to -sort price- then -sort foreign-. If
> you do this multiple times you'll note that the ordering is different after
> -sort foreign-.
>
> -Sarah
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Andrew Maurer
> Sent: Thursday, March 06, 2014 11:36 AM
> To: [email protected]
> Subject: st: Stata treatment of sort order
>
> Hi Statalist,
>
> I'm wondering if anyone can help explain some details about Stata and
> sorting
>
> First, where does Stata hold information about current sort order? Ie, the
> extended macro function --`: sortedby'-- returns the current sort order.
> However, looking at --char dir-- and --macro dir-- I don't see the
> information there. In particular, I want to overwrite the value, so that
> --`: sortedby'-- will return the value that I insert. One use might be if I
> -infile-, and I already know the sort order of the data, but don't want to
> have to run sort just to populate `: sortedby'. (In --help dta--, I see
> where it's stored in a physical dta file [<sortlist>sortlist</sortlist>],
> but it doesn't explain where it is put in memory.
>
> Second, the help file for sort seems somewhat misleading. --help sort--
> explains, "Without the stable option, the ordering of observations with
> equal values of varlist is randomized." What does "randomized" here mean? I
> interpret it to mean that each residual observation has an equal probability
> of being in any of the slots specified by the sort list (eg that --sort
> var1-- is equivalent to --gen rand = runiform()-- --sort var1 rand-- --drop
> rand-- However, residual sort order doesn't always appear random. For
> example, if I --sysuse auto--, --sort foreign--, then copy the data to
> clipboard, --clear--, then use data editor to paste the data back, and
> finally --sort foreign--, the ordering is always the same as the original
> ordering (ie: the ordering of observations with equal values of varlist was
> /not/ randomized.
>
> Is anyone able to explain these observations?
>
> Thank you,
>
> Andrew Maurer
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/