Title | Multiple operations on data records | |
Author | David Kantor, Johns Hopkins University |
I'm a SAS user new to Stata. I have not been able to find any references on how to perform multiple operations on data records if a condition is met.
For example, if I want to reset var1 and var2 based on CONDITION1 and CONDITION2, I've so far only been able to use redundant code:
. replace var1 = 1 if CONDITION1 & CONDITION2 . replace var2 = 'Y' if CONDITION1 & CONDITION2
In SAS I would write
if CONDITION1 and CONDITION2 then do; var1 = 1; var2 = 'Y'; end;
I'd also like to figure out how to nest IF statements. In SAS, I could write
if CONDITION1 and CONDITION2 then do; var1 = 1; var2 = 'Y'; if CONDITION3 then var3 = 100; end;
First, you need to understand the distinction between the if statement and the if qualifier.
The if qualifier is a clause you tack onto a statement or program call, such as in your example:
. replace var1 = 1 if CONDITION1 & CONDITION2
Assuming that CONDITION1 & CONDITION2 involve variables, then this operation will apply to some subset of the observations—possibly some, but not necessarily all the observations, depending on how these conditions evaluate on the data. Think of this as a filter that screens which observations the statement applies.
If, on the other hand, CONDITION1 & CONDITION2 are constant (do not depend on variables), then it is still a filter, but you are filtering in either all or none of the observations. See http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for more information. Here it might be better to use an if statement, which will be explained later.
The repetition of if qualifiers you cited is a common practice in Stata, and it is usually not considered a problem. If the condition is complex and you don't want to waste computer time recalculating it for each statement (or risk not typing it exactly the same in each statement), then you would want to capture its values in a new variable. You would do something like the following statements:
. generate byte cond7 = CONDITION1 & CONDITION2 . replace var1 = 1 if cond7 . replace var2 = 'Y' if cond7
The if qualifier cannot be nested in the same way as SAS. In Stata, the equivalent of your nesting example would be, in addition to the statements above,
. replace var3 = 100 if cond7 & CONDITION3
(You may want to drop cond7 later. Or, if your code is in a program or do-file, use a tempvar, and it will be automatically dropped when the program or do-file exits.)
The if statement is something entirely different. It controls whether a statement or block of statements gets executed. In this situation, the if keyword is at the beginning of the statement:
if CONDITION4 { ... OTHER STATEMENTS ... }
The condition controlling it usually does not involve variables. (If it does and the variable is not subscripted, then the value in the first observation is taken. It is unlikely that you would really want to do such a thing, though one might code it by mistake.) You can combine several statements under an if statement, but the whole block will either be executed or skipped. I recommend that you see [P] if or help ifcmd. if statements can be nested.
An if statement can optionally be followed by an else statement. (But, the if qualifier does not have a corresponding else part. Although for assigning values, there is something analagous in the cond() function, which will be described below.)
Finally, and this is key to understanding the distinction between the if statement and the if qualifier, as well as to the difference between SAS and Stata, be aware that Stata applies each operation, in turn, to the whole dataset, subject to filtering by if qualifiers. Thus in the example above involving the if qualifier, the first replace command is applied to all observations (subject to the filtering imposed by its if qualifier); then the second replace command is applied to all observations (subject to the filtering imposed by its if qualifier). That is why you can't combine several commands under one if qualifier. SAS does it the other way: the whole sequence of statements is executed for the first observation, then the second observation, etc. This is a significant difference.
(Another way to look at this is to note that any statement that applies to the whole set of observations involves an implicit loop that steps through all the observations. In Stata, that loop occurs separately for each statement. In SAS, it surrounds the whole sequence of statements.)
I also recommend that you look up the cond() function, which can make certain constructs much more efficient. As a novice, I would write the following code:
. generate byte a = 1 if y <= 20 . replace a = 2 if y > 20 & y <= 30 . replace a = 3 if y > 30 & y <= 40 . replace a = 4 if y > 40 & y <.
I now do it this way (though some people would debate whether this is better):
. #delim ; . generate byte a = cond(y<=20, 1, cond(y<=30, 2, cond(y<=40, 3, cond(y<., 4, . ))));
One more thing: beware of missing values in conditions. They are taken as true. Also they are greater than any normal number in comparison operations. See the FAQ: "Why is x > 1000 true when x contains missing value?" for details.