Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: bysort, forvalues and _N
From
KOTa <[email protected]>
To
[email protected]
Subject
Re: st: bysort, forvalues and _N
Date
Sat, 26 Jan 2013 11:52:05 +0200
thank you
C.
On Sat, Jan 26, 2013 at 2:39 AM, Nick Cox <[email protected]> wrote:
> This is something I have in draft form:
>
> Often dates for transactions or events come in pairs, say start and
> finish or open and close. People arrive for an appointment at a clinic
> and later leave; people order goods, which are later delivered; people
> start work for an employer, and later leave. In all cases, when we take
> a snapshot, we may find people still in the clinic, goods not yet
> delivered, or (it is to be hoped) people still working for the employer.
>
> With such events, it is natural that each pair of events is often
> recorded as an observation (case, row, or record) in a dataset. Such a
> structure makes some calculations easy. For exanple, the differences
> between arrivals and departures or orders and deliveries are key to
> system performance, although the ideal will be short delays for seeing
> patients or selling goods and long delays for periods of employment or
> lifetimes. In Stata with two variables for (say) -arrival- and -depart-,
> the delay is just
>
> . gen lapse = depart - arrival
>
> Precisely how we record time will depend on the problem, but here I am
> imagining a date or date-time or time variable.
>
> If the closing events have yet to happen, then -depart- may need to
> be recorded as missing. If so, -lapse- will in turn be missing.
> Notice, by the way, a simple data quality check that the time lapse can
> never be negative. Time lapses recorded as zeros might also need
> checking in some situations; was the system really that fast and
> efficient (or ruthless)?
>
> Such a simple data structure -- one observation for each transaction --
> may also be awkward, which leads to the main reason for this tip.
> Experience with data structures in Stata might lead readers to suggest a
> -reshape long-, which could be a good idea, but there is an easier
> alternative, to use -expand-.
>
> We need first a unique or distinct identifier for each transaction,
> which may already exist. The command -isid- allows a simple check of
> whether an identifier variable indeed matches its purpose. If the
> identifier variable is broken or non-existent, then something like
>
> . gen long id = _n
>
> creates a new identifier fit for purpose. Specifying a -long-
> variable type allows over 2 billion distinct positive identifiers,
> should they be needed. Otherwise we use the existing identifier. Then
>
> . expand 2
>
> is the abracadabra needed. We turn each observation into two. The new
> observations are added at the end of the dataset, so we need to sort
> them before we can create two new variables that are the keys to other
> calculations.
>
> . bysort id: gen time = cond(_n == 1, arrival, depart)
>
> Each distinct value of the identifier now occurs precisely twice. We can
> therefore use the framework provided by -by:-. See Cox (2002) for a
> tutorial if desired. Under -by:-, the observation number _n is
> interpreted within groups (here all pairs) and we assign -arrival-
> to the first observation of two and -depart- to the second.
>
> Let's imagine a small section of a toy dataset and apply our expansion
> method.
>
> . list
>
> +-----------------------+
> | id arrival depart |
> |-----------------------|
> 1. | 1 1000 1100 |
> 2. | 2 1100 1300 |
> 3. | 3 1200 1400 |
> +-----------------------+
>
> . expand 2
> (3 observations created)
>
> . bysort id : gen time = cond(_n == 1, arrival, depart)
>
> . by id : gen inout = cond(_n == 1, 1, -1)
>
> . sort time
>
> . list, sep(0)
>
> +--------------------------------------+
> | id arrival depart inout time |
> |--------------------------------------|
> 1. | 1 1000 1100 1 1000 |
> 2. | 1 1000 1100 -1 1100 |
> 3. | 2 1100 1300 1 1100 |
> 4. | 3 1200 1400 1 1200 |
> 5. | 2 1100 1300 -1 1300 |
> 6. | 3 1200 1400 -1 1400 |
> +--------------------------------------+
>
> The flag variable -inout- records additions and subtractions, so
> that the cumulative or running sum keeps track of the number inside the
> system. In a jargon common in economics, flows are used to calculate
> stocks, it being understood that any stock from before the start of
> records would need to be added.
>
> . gen present = sum(inout)
>
> . list, sep(0)
>
> +------------------------------------------------+
> | id arrival depart time inout present |
> |------------------------------------------------|
> 1. | 1 1000 1100 1000 1 1 |
> 2. | 1 1000 1100 1100 -1 0 |
> 3. | 2 1100 1300 1100 1 1 |
> 4. | 3 1200 1400 1200 1 2 |
> 5. | 2 1100 1300 1300 -1 1 |
> 6. | 3 1200 1400 1400 -1 0 |
> +------------------------------------------------+
>
> This is only one trick, and others will depend on your problem. For
> example, if a clinic is only open daily, the number present should drop
> to zero at the end of each day. More generally, stocks can not be
> negative. The logic of how your system operates provides a logic for your
> code and checks on data quality.
>
> For a different problem arising with paired data, see Cox (2008).
>
> Cox, N. J. 2002.
> Speaking Stata: How to move step by: step.
> Stata Journal 2: 86--102.
>
> Cox, N. J. 2008.
> Tip 71: The problem of split identity, or how to group dyads.
> Stata Journal 8: 588--591.
>
> On Fri, Jan 25, 2013 at 11:26 PM, Constantin Alba
> <[email protected]> wrote:
>> thanks Nick,
>>
>> that is what i was looking for exactly. and the data actually was like
>> that (start/end date) originally. so i just used the trick on it. the
>> results are correct.
>>
>> I did understand how this trick worked in the example after the link,
>> but in current example i am missing it. can you, please, explain the
>> use of "bysort obsno" here?
>>
>> thanks
>>
>> C.
>>
>> On Fri, Jan 25, 2013 at 10:10 PM, Nick Cox <[email protected]> wrote:
>>> I wouldn't approach it like that. I don't think a loop is required,
>>> but rather a different data structure. Also, holding Duration may be
>>> interesting and useful, but it is just awkward for this problem.
>>>
>>> First save the dataset if not already -save-d. Then
>>>
>>> gen long obsno = _n
>>> gen endDate = startDate + Duration
>>> expand 2
>>> bysort obsno : gen Date = cond(_n == 1, startDate, endDate)
>>> bysort obsno : gen inOut = cond(_n == 1, 1, -1)
>>> bysort PersonID (Date) : gen noActivities = sum(inOut)
>>>
>>> Compare e.g. thread starting at
>>>
>>> http://www.stata.com/statalist/archive/2012-11/msg01163.html
>>>
>>> Nick
>>>
>>> On Fri, Jan 25, 2013 at 7:38 PM, Constantin Alba
>>> <[email protected]> wrote:
>>>
>>>> I have a dataset organized by personID, for each personID there are
>>>> startDate and Duration for various activities.
>>>>
>>>> I want to find out how many activities a person performs at the same
>>>> time (e.g. something like workload)
>>>>
>>>> logically i see it like this:
>>>>
>>>> sort personID startDate Duration
>>>>
>>>> by personID: forvalues i=1/`=_N' {
>>>> "compare current date with all
>>>> other prior dates+duration
>>>> and update counter accordingly"
>>>> }
>>>>
>>>> however, by and forvalues do not work together, so i am trying to find
>>>> an alternative way to do so.
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/