Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: bysort, forvalues and _N

From	KOTa <[email protected]>
To	[email protected]
Subject	Re: st: bysort, forvalues and _N
Date	Sat, 26 Jan 2013 11:52:05 +0200

thank you

C.

On Sat, Jan 26, 2013 at 2:39 AM, Nick Cox <[email protected]> wrote:
> This is something I have in draft form:
>
> Often dates for transactions or events come in pairs, say start and
> finish or open and close. People arrive for an appointment at a clinic
> and later leave; people order goods, which are later delivered; people
> start work for an employer, and later leave.  In all cases, when we take
> a snapshot, we may find people still in the clinic, goods not yet
> delivered, or (it is to be hoped) people still working for the employer.
>
> With such events, it is natural that each pair of events is often
> recorded as an observation (case, row, or record) in a dataset. Such a
> structure makes some calculations easy. For exanple, the differences
> between arrivals and departures or orders and deliveries are key to
> system performance, although the ideal will be short delays for seeing
> patients or selling goods and long delays for periods of employment or
> lifetimes. In Stata with two variables for (say) -arrival- and -depart-,
> the delay is just
>
> . gen lapse = depart - arrival
>
> Precisely how we record time will depend on the problem, but here I am
> imagining a date or date-time or time variable.
>
> If the closing events have yet to happen, then -depart- may need to
> be recorded as missing. If so, -lapse- will in turn be missing.
> Notice, by the way, a simple data quality check that the time lapse can
> never be negative. Time lapses recorded as zeros might also need
> checking in some situations; was the system really that fast and
> efficient (or ruthless)?
>
> Such a simple data structure -- one observation for each transaction --
> may also be awkward, which leads to the main reason for this tip.
> Experience with data structures in Stata might lead readers to suggest a
> -reshape long-, which could be a good idea, but there is an easier
> alternative, to use -expand-.
>
> We need first a unique or distinct identifier for each transaction,
> which may already exist. The command -isid- allows a simple check of
> whether an identifier variable indeed matches its purpose. If the
> identifier variable is broken or non-existent, then something like
>
> . gen long id = _n
>
> creates a new identifier fit for purpose. Specifying a -long-
> variable type allows over 2 billion distinct positive identifiers,
> should they be needed. Otherwise we use the existing identifier. Then
>
> . expand 2
>
> is the abracadabra needed. We turn each observation into two. The new
> observations are added at the end of the dataset, so we need to sort
> them before we can create two new variables that are the keys to other
> calculations.
>
> . bysort id: gen time = cond(_n == 1, arrival, depart)
>
> Each distinct value of the identifier now occurs precisely twice. We can
> therefore use the framework provided by -by:-. See Cox (2002) for a
> tutorial if desired. Under -by:-, the observation number _n is
> interpreted within groups (here all pairs) and we assign -arrival-
> to the first observation of two and -depart- to the second.
>
> Let's imagine a small section of a toy dataset and apply our expansion
> method.
>
> . list
>
>      +-----------------------+
>      | id   arrival   depart |
>      |-----------------------|
>   1. |  1      1000     1100 |
>   2. |  2      1100     1300 |
>   3. |  3      1200     1400 |
>      +-----------------------+
>
> . expand 2
> (3 observations created)
>
> . bysort id : gen time = cond(_n == 1, arrival, depart)
>
> . by id : gen inout = cond(_n == 1, 1, -1)
>
> . sort time
>
> . list, sep(0)
>
>      +--------------------------------------+
>      | id   arrival   depart   inout   time |
>      |--------------------------------------|
>   1. |  1      1000     1100       1   1000 |
>   2. |  1      1000     1100      -1   1100 |
>   3. |  2      1100     1300       1   1100 |
>   4. |  3      1200     1400       1   1200 |
>   5. |  2      1100     1300      -1   1300 |
>   6. |  3      1200     1400      -1   1400 |
>      +--------------------------------------+
>
> The flag variable -inout- records additions and subtractions, so
> that the cumulative or running sum keeps track of the number inside the
> system.  In a jargon common in economics, flows are used to calculate
> stocks, it being understood that any stock from before the start of
> records would need to be added.
>
> . gen present = sum(inout)
>
> . list, sep(0)
>
>      +------------------------------------------------+
>      | id   arrival   depart   time   inout   present |
>      |------------------------------------------------|
>   1. |  1      1000     1100   1000       1         1 |
>   2. |  1      1000     1100   1100      -1         0 |
>   3. |  2      1100     1300   1100       1         1 |
>   4. |  3      1200     1400   1200       1         2 |
>   5. |  2      1100     1300   1300      -1         1 |
>   6. |  3      1200     1400   1400      -1         0 |
>      +------------------------------------------------+
>
> This is only one trick, and others will depend on your problem. For
> example, if a clinic is only open daily, the number present should drop
> to zero at the end of each day. More generally, stocks can not be
> negative. The logic of how your system operates provides a logic for your
> code and checks on data quality.
>
> For a different problem arising with paired data, see Cox (2008).
>
> Cox, N. J. 2002.
> Speaking Stata:  How to move step by: step.
> Stata Journal 2: 86--102.
>
> Cox, N. J. 2008.
> Tip 71: The problem of split identity, or how to group dyads.
> Stata Journal 8: 588--591.
>
> On Fri, Jan 25, 2013 at 11:26 PM, Constantin Alba
> <[email protected]> wrote:
>> thanks Nick,
>>
>> that is what i was looking for exactly. and the data actually was like
>> that (start/end date) originally. so i just used the trick on it. the
>> results are correct.
>>
>> I did understand how this trick worked in the example after the link,
>> but in current example i am missing it. can you, please, explain the
>> use of "bysort obsno" here?
>>
>> thanks
>>
>> C.
>>
>> On Fri, Jan 25, 2013 at 10:10 PM, Nick Cox <[email protected]> wrote:
>>> I wouldn't approach it like that. I don't think a loop is required,
>>> but rather a different data structure. Also, holding Duration may be
>>> interesting and useful, but it is just awkward for this problem.
>>>
>>> First save the dataset if not already -save-d. Then
>>>
>>> gen long obsno = _n
>>> gen endDate = startDate + Duration
>>> expand  2
>>> bysort obsno : gen Date = cond(_n == 1, startDate, endDate)
>>> bysort obsno : gen inOut = cond(_n == 1, 1, -1)
>>> bysort PersonID (Date) : gen noActivities = sum(inOut)
>>>
>>> Compare e.g. thread starting at
>>>
>>> http://www.stata.com/statalist/archive/2012-11/msg01163.html
>>>
>>> Nick
>>>
>>> On Fri, Jan 25, 2013 at 7:38 PM, Constantin Alba
>>> <[email protected]> wrote:
>>>
>>>> I have a dataset organized by personID, for each personID there are
>>>> startDate and Duration for various activities.
>>>>
>>>> I want to find out how many activities a person performs at the same
>>>> time (e.g. something like workload)
>>>>
>>>> logically i see it like this:
>>>>
>>>> sort personID startDate Duration
>>>>
>>>> by personID: forvalues i=1/`=_N' {
>>>>                                    "compare current date with all
>>>> other prior dates+duration
>>>>                                        and update counter accordingly"
>>>>                                       }
>>>>
>>>> however, by and forvalues do not work together, so i am trying to find
>>>> an alternative way to do so.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: bysort, forvalues and _N
  - From: Constantin Alba <[email protected]>
- Re: st: bysort, forvalues and _N
  - From: Nick Cox <[email protected]>
- Re: st: bysort, forvalues and _N
  - From: Constantin Alba <[email protected]>
- Re: st: bysort, forvalues and _N
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: filling missing values
Next by Date: Re: st: advice on using Stata in an undergraduate Intro Stats course
Previous by thread: Re: st: bysort, forvalues and _N
Next by thread: st: RE: capture xtmixed
Index(es):
- Date
- Thread