Title | True and false in Stata | |
Author | Nicholas J. Cox, Durham University, UK |
Most computer languages have some way of indicating and working with what is true and what is false, but not all languages choose exactly the same way. Stata follows two rules, the second of which may be considered as a generalization of the first. I will state the rules, and then we will look at each in turn.
First, consider the results of logical or Boolean expressions. (George Boole worked on logic and probability in the nineteenth century. For more about George Boole, see http://www-history.mcs.st-and.ac.uk/~history/Mathematicians/Boole.html.) In Stata, these expressions use one or more various relational and logical operators. The operators ==, ~=, !=, >, >=, <, and <= are used to test equality or inequality. The operators & | ~ and ! are used to indicate "and", "or", and "not". It is a matter of taste whether you use ~ or ! to indicate negation. In this FAQ, we use !. If you want to learn more about any of these, see operators.
For example, in the auto dataset, the expression foreign == 1 will be true for those observations where the variable foreign is 1 and false otherwise. The double equal sign == is used whenever you wish to test for equality; compare the use of the single equal sign = for assignment. As a second example, the expression 2 == 2 is always true. That may not seem helpful or instructive, but below we will see a use for expressions that are necessarily always true. More complicated expressions can readily be constructed: foreign == 1 & rep78 == 4 will be true whenever foreign == 1 and rep78 == 4. Typing
. count if foreign == 1 & rep78 == 4
shows that there are nine such cars in the auto dataset. (Incidentally, the count command may seem trivial, yet it is a simple way of getting answers to some basic questions about your data.)
Logical expressions have numerical values, which can be immensely useful. In Stata, the rule is that false logical expressions have value 0 and true logical expressions have value 1. Thus logical expressions may be used to generate indicator variables (also often called binary, dichotomous, dummy, logical, or Boolean, depending on tribal jargon), which have values 0 or 1. The command
. generate himpg = mpg > 30
will generate a new variable that is 1 whenever mpg is greater than 30, and 0 otherwise. Two wrinkles should now be mentioned. What if mpg were missing? The rule is that Stata treats numeric missing values as higher than any other numeric value, so missing would certainly qualify as greater than 30, and any observation with mpg missing would be assigned 1 for this new variable. This rule leads to the next wrinkle: typing
. generate himpg = mpg > 30 if mpg < .
would assign 1 if mpg were greater than 30 but not missing; 0 if mpg were not greater than 30; and missing if mpg were missing. The logic is that you did not say what result you wanted if mpg were missing; in the absence of instructions, Stata will shrug its shoulders in the only way it knows, assigning a result of missing. The same logic would apply if you were only interested in domestic cars:
. generate himpg = mpg > 30 if foreign == 0
If foreign were not equal to 0, then the result would be missing. Otherwise, the result would be 1 or 0 according to whether mpg was or was not greater than 30.
Numerical value of logical expressions always proves useful when we want to count something. Suppose we want to create a new variable in which we will put the frequencies of mpg being greater than 30, by categories of rep78:
. sort rep78 . by rep78: generate nhimpg = sum(mpg > 30) . by rep78: replace nhimpg = nhimpg[_N]
In the second statement, the function sum() produces a cumulative or running sum of mpg > 30. If mpg > 30, 1 is added to the sum; otherwise, 0 is added. This statement yields a running count of the number of observations for which mpg > 30. In the third statement, we replace the running count with its last value, the total count. This process is all done within the framework of by, for which data must be sorted on rep78, which is done first. Under by:, the generate is carried out separately for each group of rep78. Similarly, the replace is done separately for each group of rep78. (You are also able to save a statement by making use of by..., sort, but that is incidental to the main idea.)
As it happens, there is a quicker way to do the above commands with egen:
. egen nhimpg = total(mpg > 30), by(rep78)
The built-in function sum() produces cumulative or running sums, whereas the egen function total() produces just sums.
Here we use the fact that there are no missing values of mpg in the auto dataset. And, whenever you know this is true of a variable in your data, you too can ignore the possibility of missing values. But, a more general method for counting observations greater than some threshold is to use total(varname>threshold & varname< .). That is a safe and never sorry method whenever you want to exclude missing values. (Of course, if missing means in practice "too high to be measured", then you might want to include missing.)
Now consider what happens if you type something like
. list mpg if foreign == 1
Stata lists mpg for those observations for which foreign is equal to 1 (and does not list them if this is not so). Stata lists mpg whenever the logical expression foreign == 1 is true or evaluates to 1. We see above a more long-winded explanation of this process.
This method looks like the same idea in a different form. It is, but there are extra twists. Consider now
. list mpg if foreign
There are no relational or logical operators in sight, but Stata is broad-minded here. It will still try its best to find a way of deciding on true or false; in fact, it will accept any argument that evaluates to a number not 0 as true, and any argument that evaluates to 0 as false. If the mathematical or computer jargon "argument" is new to you, think of it here as indicating whatever is fed to if.
For a numeric variable such as foreign, Stata looks at the values of that variable, and not 0 is treated as true and 0 as false. In other words,
. whatever if foreign
and
. whatever if foreign != 0
are exactly equivalent. This is always true for any numeric variable. In practice, there is a shortcut if and only if you have an indicator variable that takes only the values 0 or 1. The two statements
. list mpg if foreign == 1 . list mpg if foreign
are equivalent in practice in the auto dataset. In the first statement, Stata evaluates the expression foreign == 1, and then executes the action indicated (to list) if and only if the expression is true, or evaluates numerically to 1. In the second statement, Stata looks at the values of the variable foreign, and then executes the action if and only if the value is a number not 0. In the auto dataset, foreign is not 0 when and only when it is equal to 1, so the two conditions are satisfied by exactly the same observations. Over time this will save you many keystrokes when you are working with indicator variables, and it will let you type Stata syntax close to the way you are thinking, say, if female or even if !female. (The ! is a way of reversing the choice: ! flips any value not 0 to 0, and any value 0 to 1.). But remember that numeric missings count as not 0 because they indicate a number much greater than 0.
You can always check, either interactively or in a program, that a variable has only the values 0 and 1 by using assert:
. assert varname == 0 | varname == 1
If varname were equal to any other value, Stata would deny the assertion. If you typed, perhaps by accident,
. list mpg if rep78
you will get a list for all observations, because rep78 is never 0. It is the same logic.
If the argument were just a number, then the same logic still applies. This logic also can be useful with if. For example, you could count missing values and take some action only if one or more missing values were present. It can also be useful with the while command, which is more of a programmer's command, which we will illustrate in more detail. while 1 gives you an endless loop: the 1 is arbitrary here, as any number not 0 would do. Presumably, within your otherwise endless loop, you will add some test that gets Stata out of the loop, say, with continue. A related technique is to set a flag and to exit the loop only if and when that flag has been changed:
. local worktodo = 1 . while `worktodo' { program statements including setting `worktodo' to 0 when task completed }
Finally, if you were to supply, perhaps by accident, the name of a string variable or a text string as an argument to if or while, there would be an error message, as Stata cannot interpret either as a numeric argument. Only numeric arguments can be considered true or false.