Stata | FAQ: Sorting on categorical variables

Home / Resources & support / FAQs / Sorting on categorical variables

Why does my do-file or ado-file produce different results every time I run it?

Title		Sorting on categorical variables
Author		William Gould, StataCorp

There is really no general answer to this question other than your program has an error in it. There is, however, one common error even experienced Stata users make:

If you sort on a variable that does not have unique values for every observation in the data and subsequently refer, implicitly or explicitly, to the order within group (say, by referring to _n or _N with by), the results will vary every time you run the file.

Consider the following dataset:

 . sort group

 . list

      +-----------------+
      | group   x1   x2 |
      |-----------------|
   1. |     1    5    7 |
   2. |     1    2    6 |
   3. |     1    3    9 |
   4. |     2    1    2 |
   5. |     2    7    4 |
      +-----------------+

The first value of x1 in the first group is 5. Now let us jumble up these data (we will sort on x1) and then sort the data again by group:

 . sort x1

 . sort group

 . list

      +-----------------+
      | group   x1   x2 |
      |-----------------|
   1. |     1    3    9 |
   2. |     1    5    7 |
   3. |     1    2    6 |
   4. |     2    1    2 |
   5. |     2    7    4 |
      +-----------------+

Before, the first value of x1 in the first group was 5, now it is 3. Why the change? Because group takes on repeated values across observations, we said sort group, and we did not say how the data should be sorted within group. Since we did not specify, Stata chose an order at random.

People have sent us do-files that contain

        ...
        sort patid
        quietly by patid: keep if age[1]>20
        ...

The intent of these lines was to select patients who were at least age 20, but that is not what the user got and, moreover, the user got a different sample every time he ran the do-file. The problem was that patid took on repeated values, so saying sort patid was not enough to specify what the order should be within patid. The user meant to code

        ...
        sort patid age
        quietly by patid: keep if age[1]>20
        ...

        ...
        sort patid time
        quietly by patid: keep if age[1]>20
        ...

Now, pretend that, rather than keeping all patient records, we wanted to just keep the first record.

        ...
        sort patid age
        quietly by patid: keep if _n==1
        ...

Sorting on both patid and age might not be sufficient because each patient might have multiple records with the same age. We would be selecting one record at random from the earliest records for each patient. If our data included variable time and time was unique within patient,

        ...
        sort patid time
        quietly by patid: keep if _n==1
        ...

would be better.

In other words, be careful. There is nothing wrong with sorting on categorical variables by themselves—sort patid and sort group—just do not assume that the order within the grouping variable is unique. Be especially careful when selecting observations within groups.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Why does my do-file or ado-file produce different results every time I run it?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Why does my do-file or ado-file produce different results every time I run it?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies