Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Observations that keep a feature...
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Observations that keep a feature...
Date
Thu, 23 May 2013 19:24:57 +0100
This is getting very intricate to follow.
As Sarah posted yesterday, more or less, we need examples.
I worry on your behalf that you will have to explain your rules to
somebody reviewing your thesis/dissertation/report/paper and they are
going to ask you why you couldn't use much simpler rules.
Nick
[email protected]
On 23 May 2013 18:43, Miguel Angel Duran Munoz <[email protected]> wrote:
> Nick and Sarah, thanks to your help I've been able to solve all but one of
> my problems. To select agents that are above the threshold after period 2,
> I've finally used:
>
> egen firstperiod = min(period), by(agent)
> drop if firstperiod > 2
> bysort agent (period): gen first2 = _n < 3
> egen min_rest = min(score / !first2), by(agent)
> keep if min_rest >= 0.9
>
> (the max condition that Nick suggested me is, I think, unnecessary)
>
> Nevertheless, I am not sure about how to select agents that overpass the
> threshold in the final periods (say at or after t3) and maintain over it.
> In principle, based on your suggestions, I thought of this:
>
> bysort agent (period): gen last=score[_N]
> bysort entity (date2): gen first2 = _n < 3
> egen min_rest = min(score / !first2), by(agent)
> keep if last>=0.9 & min_rest<=0.9
>
> Nevertheless, this implies that I am excluding agents that satisfy the
> criterion (overpassing the threshold at or after t3) but appear in the
> sample at an intermediate period.
>
> Will someone please help to solve this? Thanks in advance.
>
> Miguel.
>
>> Sarah, thank you for your help. I am very sorry for not having put my
>> doubts in a sufficiently clear way. And given what you say about the way
>> data is stored I have realized that there might be other problems around.
>> I will try to be as clear as possible.
>>
>> My data is in panel data form. I write the example down again in the way
>> my data is stored. As regards the example in my previous messages, I add
>> two agents (6 and 7). Please note also that data referring to agent fifth
>> is missing in some periods, but there is no line corresponding to those
>> periods (this is what I had not taken into account so far):
>>
>> time agent score
>> t1 1 0.8
>> t2 1 1
>> t3 1 1
>> t4 1 1
>> t5 1 1
>> t6 1 1
>>
>> t1 2 0.8
>> t2 2 0.8
>> t3 2 1
>> t4 2 1
>> t5 2 1
>> t6 2 1
>>
>> t1 3 0.8
>> t2 3 0.8
>> t3 3 0.8
>> t4 3 1
>> t5 3 1
>> t6 3 1
>>
>> t1 4 0.8
>> t2 4 0.8
>> t3 4 0.8
>> t4 4 0.8
>> t5 4 1
>> t6 4 1
>>
>> t6 5 1
>>
>> t1 6 0.8
>> t2 6 0.8
>> t3 6 0.8
>> t4 6 0.8
>> t5 6 1
>> t6 6 1
>>
>> t1 7 0.8
>> t2 7 1
>> t3 7 1
>> t4 7 0.8
>> t5 7 0.8
>> t6 7 1
>>
>> Having said that, I want to split the sample in different ways. First, I
>> want to focus on agents that overpass a threshold (eg, 0.9) since the
>> first period and are always above the threhold (ie, agent 1). Second, I
>> want to take agents that overpass the threshold at or before a particular
>> period (eg, t3) and since then they are above the threshold (ie, agents
>> 1-4). Third, agents that overpass the threshold at or after a particular
>> period (eg, t5) and since then they are above the threshold (ie, agents 5
>> and 6). Please note that agent 7 is not included in any of the previous
>> subsamples.
>>
>> Thank you very much for your help. And once again, I am sorry for not
>> having been clear enough.
>>
>> Miguel.
>>
>>
>>
>>
>>> Miguel,
>>> This discussion would be clearer if your examples actually made it clear
>>> exactly what your data looks like.
>>>
>>> Your example below looks like you have data in wide form. The solution
>>> that Nick suggested is for data in long form. It's easy enough to move
>>> between the two, but it's hard to make concrete suggestions about how to
>>> proceed when we don't know what the actual data looks like.
>>>
>>> I'll start by assuming, as Nick does, that your data is actually in long
>>> form and you have three variables: agent, period, score. I'll further
>>> assume that for agent 5 you simply have no records for periods 1-5 (that
>>> is, you do not have records for those periods with missing values for
>>> score). If that's true, you can simply calculate the first period that
>>> appears in the data and use that as part of your inclusion criteria.
>>> Something like the following will keep only those agents who first
>>> appear
>>> in the data before period 4:
>>> egen firstperiod=min(period), by(agent)
>>> drop if firstperiod>4
>>>
>>> Or maybe you only want to include agents who start in period 1? It's
>>> unclear from your question. In that case you'd -drop if firstperiod>1-
>>>
>>> For your second example, trying to look at the last time periods, I
>>> think
>>> you need to clarify what your actual criteria is. You say "I would like
>>> to select those agents that overpass the threshold of 0.9 in any the
>>> last
>>> two periods and are over the threshold until the end of the sample
>>> period
>>> (ie, agents 4 and 5)." To my eye, that criteria includes all agents
>>> except agent 6. You're unlikely to get the results you hope for unless
>>> you are precise in the criteria you're using.
>>>
>>> Hope that helps.
>>>
>>> -Sarah
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Miguel Angel
>>> Duran Munoz
>>> Sent: Wednesday, May 22, 2013 11:00 AM
>>> To: [email protected]
>>> Subject: Re: st: Observations that keep a feature... an additional
>>> problem
>>>
>>> I use the same example than in a previous message, but I add a fifth
>>> agent
>>> that joins in period six:
>>>
>>>
>>> Agent 1: 1 1 1 1 1 1...
>>> Agent 2: 0.8 1 1 1 1 1...
>>> Agent 3: 0.8 0.8 0.8 1 1 1...
>>> Agent 4: 0.8 0.8 0.8 0.8 1 1...
>>> Agent 5: . . . . . 1...
>>>
>>> I want to keep just the first three agents.
>>>
>>>
>>> If you don't mind, Nick, I would also like to ask you the following. I
>>> take the same example, but I focus on the last periods.
>>>
>>> Agent 1: ...1 1 1 1 1 1
>>> Agent 2: ...0.8 1 1 1 1 1
>>> Agent 3: ...0.8 0.8 0.8 1 1 1
>>> Agent 4: ...0.8 0.8 0.8 0.8 1 1
>>> Agent 5: ... . . . . . 1
>>> Agent 6: ...0.8 0.8 0.8 0.8 1 0.8
>>>
>>> I would like to select those agents that overpass the threshold of 0.9
>>> in
>>> any the last two periods and are over the threshold until the end of the
>>> sample period (ie, agents 4 and 5).
>>> I have tried to modify the commands that you have suggested me before,
>>> but
>>> I have not been able to get the right selection. Would you mind helping
>>> me
>>> with this? Thank you very much.
>>>
>>>> I can't follow this. I see only "the rules select too many agents".
>>>>
>>>> You tell me your precise rules and I will try to think of code to
>>>> implement them.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 22 May 2013 18:16, Miguel Angel Duran Munoz <[email protected]> wrote:
>>>>> Nick, after reducing the sample using your suggestion, I have checked
>>>>> the number of agents that there are per period. And the number is
>>>>> increasing in time. I guess this is due to the fact that agents
>>>>> joining the sample as time goes by and satisfying the requirement of
>>>>> being above the threshold are not excluded. Is there any trick to
>>>>> avoid including them? Thanks again.
>>>>>
>>>>>> Assuming variable names
>>>>>>
>>>>>> agent period score
>>>>>>
>>>>>> it seems that you want something like
>>>>>>
>>>>>> bysort agent (period) : gen first3 = _n < 4
>>>>>>
>>>>>> egen max_first3 = max(score / first3), by(agent)
>>>>>>
>>>>>> egen min_rest = min(score / !first3), by(agent)
>>>>>>
>>>>>> keep if max_first3 > 0.9 & min_rest > 0.9
>>>>>>
>>>>>> For the division trick in the -egen- call see e.g.
>>>>>>
>>>>>> http://www.stata.com/statalist/archive/2013-03/msg00917.html
>>>>>>
>>>>>> (reference included therein).
>>>>>>
>>>>>> Nick
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> On 22 May 2013 15:03, Miguel Angel Duran Munoz <[email protected]>
>>>>>> wrote:
>>>>>>> Nick, thanks for your help. I hope you can help me with another
>>>>>>> doubt.
>>>>>>> For
>>>>>>> a similar analysis to that of my first message, assume I want to
>>>>>>> keep those agents that that have overpass the threshold before a
>>>>>>> certain period and then have been over it in the rest of the sample
>>>>>>> period.
>>>>>>>
>>>>>>> To illustrate the idea, consider the following (data refer to
>>>>>>> consecutive periods and the threshold is, eg, 0.9):
>>>>>>>
>>>>>>> Agent 1: 1 1 1 1 1...
>>>>>>> Agent 2: 0.8 1 1 1 1...
>>>>>>> Agent 3: 0.8 0.8 0.8 1 1...
>>>>>>> Agent 4: 0.8 0.8 0.8 0.8 1...
>>>>>>>
>>>>>>> I want to keep the first three agents because they have overpassed
>>>>>>> the threshold before period 4 and then have been over the threshold
>>>>>>> in the rest of the sample period, but I do not want to keep agent 4.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Miguel.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Correct on -keep-. Sorry about that.
>>>>>>>>
>>>>>>>> The -sort- order
>>>>>>>>
>>>>>>>> bysort entity (const_a) :
>>>>>>>>
>>>>>>>> ensures that -const_a[1]- is the lowest for each agent, not the
>>>>>>>> first.
>>>>>>>> If the lowest value for each agent is above the threshold, then
>>>>>>>> all the observations for that agent are above.
>>>>>>>> Nick
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21 May 2013 23:16, Miguel Angel Duran Munoz <[email protected]>
>>>>>>>> wrote:
>>>>>>>>> Thanks, Nick. I guess you mean -keep- instead of -drop-.
>>>>>>>>> Nevertheless,
>>>>>>>>> the
>>>>>>>>> command that you suggest would not guarantee that I keep the
>>>>>>>>> agents that have been above the threhsold for the whole sample
>>>>>>>>> period (ie, I would be including agents that were above the
>>>>>>>>> threshold in the first period and then might have been above or
>>>>>>>>> below it).
>>>>>>>>>
>>>>>>>>>> Sounds like
>>>>>>>>>>
>>>>>>>>>> bysort entity (const_a) : drop if const_a[1] > 0.09716
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>> On 21 May 2013 23:01, Miguel Angel Duran Munoz <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi, Statalisters. I want to focus on agents in my dataset that
>>>>>>>>>>> have a particular feature; specifically, for those agents, and
>>>>>>>>>>> for each and every period (out of 64), the value of a variable
>>>>>>>>>>> (const_a) is larger than a particular threshold (0.097116). I
>>>>>>>>>>> have done what I show below.
>>>>>>>>>>> Nevertheless, I have realized that some of my agents are not in
>>>>>>>>>>> the sample since the first period, so what I am doing would
>>>>>>>>>>> mistakenly eliminate them. Will anyone help to solve this
>>>>>>>>>>> problem? Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>> bysort entity (date2): gen obs=_n drop if const_a<0.097116 by
>>>>>>>>>>> entity: drop if obs[_N]<64
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/