Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: bysort, forvalues and _N
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: bysort, forvalues and _N
Date
Sat, 26 Jan 2013 00:39:20 +0000
This is something I have in draft form:
Often dates for transactions or events come in pairs, say start and
finish or open and close. People arrive for an appointment at a clinic
and later leave; people order goods, which are later delivered; people
start work for an employer, and later leave. In all cases, when we take
a snapshot, we may find people still in the clinic, goods not yet
delivered, or (it is to be hoped) people still working for the employer.
With such events, it is natural that each pair of events is often
recorded as an observation (case, row, or record) in a dataset. Such a
structure makes some calculations easy. For exanple, the differences
between arrivals and departures or orders and deliveries are key to
system performance, although the ideal will be short delays for seeing
patients or selling goods and long delays for periods of employment or
lifetimes. In Stata with two variables for (say) -arrival- and -depart-,
the delay is just
. gen lapse = depart - arrival
Precisely how we record time will depend on the problem, but here I am
imagining a date or date-time or time variable.
If the closing events have yet to happen, then -depart- may need to
be recorded as missing. If so, -lapse- will in turn be missing.
Notice, by the way, a simple data quality check that the time lapse can
never be negative. Time lapses recorded as zeros might also need
checking in some situations; was the system really that fast and
efficient (or ruthless)?
Such a simple data structure -- one observation for each transaction --
may also be awkward, which leads to the main reason for this tip.
Experience with data structures in Stata might lead readers to suggest a
-reshape long-, which could be a good idea, but there is an easier
alternative, to use -expand-.
We need first a unique or distinct identifier for each transaction,
which may already exist. The command -isid- allows a simple check of
whether an identifier variable indeed matches its purpose. If the
identifier variable is broken or non-existent, then something like
. gen long id = _n
creates a new identifier fit for purpose. Specifying a -long-
variable type allows over 2 billion distinct positive identifiers,
should they be needed. Otherwise we use the existing identifier. Then
. expand 2
is the abracadabra needed. We turn each observation into two. The new
observations are added at the end of the dataset, so we need to sort
them before we can create two new variables that are the keys to other
calculations.
. bysort id: gen time = cond(_n == 1, arrival, depart)
Each distinct value of the identifier now occurs precisely twice. We can
therefore use the framework provided by -by:-. See Cox (2002) for a
tutorial if desired. Under -by:-, the observation number _n is
interpreted within groups (here all pairs) and we assign -arrival-
to the first observation of two and -depart- to the second.
Let's imagine a small section of a toy dataset and apply our expansion
method.
. list
+-----------------------+
| id arrival depart |
|-----------------------|
1. | 1 1000 1100 |
2. | 2 1100 1300 |
3. | 3 1200 1400 |
+-----------------------+
. expand 2
(3 observations created)
. bysort id : gen time = cond(_n == 1, arrival, depart)
. by id : gen inout = cond(_n == 1, 1, -1)
. sort time
. list, sep(0)
+--------------------------------------+
| id arrival depart inout time |
|--------------------------------------|
1. | 1 1000 1100 1 1000 |
2. | 1 1000 1100 -1 1100 |
3. | 2 1100 1300 1 1100 |
4. | 3 1200 1400 1 1200 |
5. | 2 1100 1300 -1 1300 |
6. | 3 1200 1400 -1 1400 |
+--------------------------------------+
The flag variable -inout- records additions and subtractions, so
that the cumulative or running sum keeps track of the number inside the
system. In a jargon common in economics, flows are used to calculate
stocks, it being understood that any stock from before the start of
records would need to be added.
. gen present = sum(inout)
. list, sep(0)
+------------------------------------------------+
| id arrival depart time inout present |
|------------------------------------------------|
1. | 1 1000 1100 1000 1 1 |
2. | 1 1000 1100 1100 -1 0 |
3. | 2 1100 1300 1100 1 1 |
4. | 3 1200 1400 1200 1 2 |
5. | 2 1100 1300 1300 -1 1 |
6. | 3 1200 1400 1400 -1 0 |
+------------------------------------------------+
This is only one trick, and others will depend on your problem. For
example, if a clinic is only open daily, the number present should drop
to zero at the end of each day. More generally, stocks can not be
negative. The logic of how your system operates provides a logic for your
code and checks on data quality.
For a different problem arising with paired data, see Cox (2008).
Cox, N. J. 2002.
Speaking Stata: How to move step by: step.
Stata Journal 2: 86--102.
Cox, N. J. 2008.
Tip 71: The problem of split identity, or how to group dyads.
Stata Journal 8: 588--591.
On Fri, Jan 25, 2013 at 11:26 PM, Constantin Alba
<[email protected]> wrote:
> thanks Nick,
>
> that is what i was looking for exactly. and the data actually was like
> that (start/end date) originally. so i just used the trick on it. the
> results are correct.
>
> I did understand how this trick worked in the example after the link,
> but in current example i am missing it. can you, please, explain the
> use of "bysort obsno" here?
>
> thanks
>
> C.
>
> On Fri, Jan 25, 2013 at 10:10 PM, Nick Cox <[email protected]> wrote:
>> I wouldn't approach it like that. I don't think a loop is required,
>> but rather a different data structure. Also, holding Duration may be
>> interesting and useful, but it is just awkward for this problem.
>>
>> First save the dataset if not already -save-d. Then
>>
>> gen long obsno = _n
>> gen endDate = startDate + Duration
>> expand 2
>> bysort obsno : gen Date = cond(_n == 1, startDate, endDate)
>> bysort obsno : gen inOut = cond(_n == 1, 1, -1)
>> bysort PersonID (Date) : gen noActivities = sum(inOut)
>>
>> Compare e.g. thread starting at
>>
>> http://www.stata.com/statalist/archive/2012-11/msg01163.html
>>
>> Nick
>>
>> On Fri, Jan 25, 2013 at 7:38 PM, Constantin Alba
>> <[email protected]> wrote:
>>
>>> I have a dataset organized by personID, for each personID there are
>>> startDate and Duration for various activities.
>>>
>>> I want to find out how many activities a person performs at the same
>>> time (e.g. something like workload)
>>>
>>> logically i see it like this:
>>>
>>> sort personID startDate Duration
>>>
>>> by personID: forvalues i=1/`=_N' {
>>> "compare current date with all
>>> other prior dates+duration
>>> and update counter accordingly"
>>> }
>>>
>>> however, by and forvalues do not work together, so i am trying to find
>>> an alternative way to do so.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/