Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: ambiguity in -if- qualifier
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: ambiguity in -if- qualifier
Date
Mon, 24 Mar 2014 17:06:34 +0000
The reason for your puzzlement is becoming much clearer, so thanks for
providing an example that can be discussed.
Note, however, that your initial word description -- in your first
paragraph -- does not fully match your code example, as your code
example bites for a quite specific reason, which only the code makes
clear.
Naturally, Stata can calculate the previous value of a time series if
the previous observation is present in the dataset, but not otherwise.
(Similar remarks apply to the effects of any time series operator or
subscripting where such imply reaching outside the observations
selected by -if-.)
Said differently, -if- selects observations to be used, but neither
the -if- qualifier nor any other part of the syntax is thereby
prohibited from invoking information in the other part of the data set
whenever -if- selects a strict subset.
But the problem here is not that Stata is being ambiguous, or
inconsistent, or incorrect, but that users need to ask for what they
want and want what they ask for.
In your example, which we can all agree to be frivolous, you in effect
carry out a regression on part of a panel and **part of what you
calculate depends on values outside the data used**. That's at best
dubious and at worst meaningless, but either way the decision to do
that is yours, not Stata's.
Otherwise put, it's your code that says "use lagged values for part of
the data" and Stata does what it is told to the best of its ability.
It's a robot and you are its instructor, in this example at least.
I agree with you that people need to think about cases like this.
Indeed, if you look at the help file for -mvsumm- (SSC) you will see
"Remarks" written (by me, as it happens) on this very point in 2005.
There are many other examples. Here is another.
sysuse auto , clear
gen mpg2 = mpg/_N if foreign
keep if foreign
gen mpg3 = mpg/_N
-mpg2- and -mpg3- are quite different, as _N is the number of
observations in the current dataset.
The only clear rule needed here is to ask for exactly what you want.
Nick
[email protected]
On 24 March 2014 16:36, Yu Chen, PhD <[email protected]> wrote:
> Hi, Nick,
> Suppose I want to do a regression only on foreign cars, using the
> auto.dta data set. I have two possible ways to do that. (1). I can
> -drop- the domestic cars at the beginning and then do the regression.
> This way the regression is performed only on the foreign cars. (2) I
> can use an -if- qualifier in the regression command to restrict the
> sample to foreign cars.
> Do you think these two methods produce the same results?
>
> Try the code below, and you will see that results differ.
>
> Code for method (1).
> sysuse auto,clear
> gen n=_n
> tsset n
> drop if foreign==0
> reg price L.mpg headroom
>
>
> Code for method (2).
> sysuse auto,clear
> gen n=_n
> tsset n
> reg price L.mpg headroom if foreign==1
>
>
> I don't think many people are aware of this issue. So it is important
> to make clear rules for the usage of -if- qualifier.
> I also thank Joe for his help.
>
>
>
>
>
> On Sat, Mar 22, 2014 at 8:09 PM, Nick Cox <[email protected]> wrote:
>> Comments below.
>>
>> Nick
>> [email protected]
>>
>>
>> On 23 March 2014 00:44, Yu Chen, PhD <[email protected]> wrote:
>>> Hi, Nick,
>>> Let me clarify. For any assignment to a new variable, there are two
>>> steps. Step 1, the expression should be evaluated; and Step2, the
>>> result of the evaluation is assigned to the new variable. My question
>>> is, what is the sample used in each step?
>>> For -generate-, Step 1 uses the full sample. In other words, all
>>> observations, regardless whether they meet the -if- condition, can be
>>> used. But in Step 2, -generate- uses the subsample that meets the -if-
>>> condition.
>>
>> I don't think this word treatment helps understanding. In your
>> -generate- example two things are happening simultaneously:
>>
>> A. Stata is being instructed to put previous values of -mpg- in a new variable.
>>
>> B. Stata is being instructed to do that only if -foreign- is 1.
>>
>> You are surmising that A is done in a Step 1, which is followed by B
>> in a Step 2. But it makes just as much sense to imagine that Stata
>> works out that the variable should receive non-missing values only
>> when -foreign- is 1 and then works out what they should be. EIther
>> way, the result is the same.
>>
>>> However, there may exist such commands that use a subsample in Step 1.
>>> In other words, before the command does any thing, the sample is
>>> reduced according to the -if- condition, so all other activities that
>>> the command is going to do are on this reduced sample. It seems to me
>>> that most commands work this way. But I found that -generate- is an
>>> exception. It does not restrict the sample until the last step.
>>> I think this is a little confusing. At least, there is no consistency
>>> in when to restrict the sample.
>>> Thank you.
>>
>> Sorry, but I don't catch your meaning here at all. You've presumably
>> withdrawn your claim about -egen-, so you seem to be offering
>> speculation, but no examples that anyone else can discuss.
>>
>>> On Sat, Mar 22, 2014 at 6:45 PM, Nick Cox <[email protected]> wrote:
>>>> I don't think the one precise example here is puzzling in any sense.
>>>> Previous values of -mpg- are put in a new variable if and only
>>>> -foreign- is 1. This is calculated observation by observation.
>>>>
>>>> You allude to different behaviour with -egen-. But the help for -egen- explains
>>>>
>>>> "Explicit subscripting (using _N and _n), which is commonly used with
>>>> generate, should not be used with egen; see subscripting."
>>>>
>>>> That may illuminate your puzzlement.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 22 March 2014 21:26, Yu Chen, PhD <[email protected]> wrote:
>>>>> I think there is some ambiguity in the meaning and usage of the -if-
>>>>> qualifier. Generally, the command is performed on a subset that meets
>>>>> the -if- condition. However, a command may perform many tasks, and the
>>>>> subset for each task is not clear sometimes. For example, for the
>>>>> -generate- command, it seems to calculate the result of the expression
>>>>> on the full sample first, and then that result is assigned to a
>>>>> subsample that meets the -if- condition. However, for the -egen-
>>>>> command, the calculation is performed on a subset that meets the -if-
>>>>> condition, not the full sample, and then that result is assigned to
>>>>> the new variable on that subsample.
>>>>>
>>>>> For example, see the code below.
>>>>>
>>>>> sysuse auto
>>>>> gen mpg2=mpg[_n-1] if foreign==1
>>>>>
>>>>> Notice that observation number 53 has a value of 24 for mpg2. This
>>>>> indicates that the task of taking a lagged value is performed on the
>>>>> full sample first. Otherwise, this value should be missing. But -egen-
>>>>> works differently.
>>>>>
>>>>> There may exist other cases that have similar ambiguities. I would
>>>>> suggest that Stata have a clear rule to address this issue. If the
>>>>> rule is already out there, please tell me.
>>>>> Thank you very much.
>>>>>
>>>>>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/