In problems like this, there is often a clash of
experts, between the by-pilots and the merge mavens.
The by-pilots love to show how a quick -sort- and an
application of -by:- can get you where you want to be.
The merge mavens set up multiple files and then put them
together with some clever choreography, -merge- magic
making light work of a mess.
You have found a way to a solution that culminates in
a -merge-, but understandably want something simpler.
If I understand this correctly, there is no need
for a hit on -collapse-. You just need some prior
surgery on -org_name- using -by:-.
In case of name conflict, you want the first -org_name-
used for each -id- to be definitive. Spreading this to
all occurrences of a given -id- is a standard problem
for -by:-. (There is some database terminology
for this; is it indeed "spreading"?)
sort id, stable
by id : replace org_name = org_name[1]
This works with both strings and numerics.
Now your data are ready for -collapse-.
Recall that under the aegis of -by:- subscripts are
reckoned within each group of observations defined by
-by:-. The -stable- option takes account of observation
order, but in most such problems people will have a time
or sequence identifier.
For a discursive tutorial on -by:-, see
SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: step
Q1/02 SJ 2(1):86-102 (no commands)
explains the use of the by varlist : construct to tackle
a variety of problems with group structure, ranging from
simple calculations for each of several groups to more
advanced manipulations that use the built-in _n and _N
Alternatively, use the manual index to identify various sections
on -by:-, and then read them all to consolidate your by-pilot
skills.
Nick
[email protected]
Caleb Southworth
> A problem that comes up occasionally when collapsing data on
> organizations
> is the inability to take the first instance of a variable in
> the set being
> collapsed. This would be most useful when that variable is a
> string, but I
> believe is also of general use. Below is my work around, but perhaps
> someone has a better one?
>
> Imagine data on organizations:
>
> id org_name _1935 _1936
> 1 foo 10 .
> 1 foo_blah . 12
> 2 noo 54 55
>
> I have duplicate ids and I want to collapse the data, but I
> don't want to
> lose the name. The non-existent command would be
>
> -collapse (first) org_name (sum) _*, by(id)-
>
> instead
>
> save orig
> duplicates drop id, force
> keep id org_name
> save temp_name, replace
> use orig, clear
> collapse (sum) _*, by(id)
> merge id using temp_name, sort
>
> (first) in collapse could also be used to priviledge the data
> on one type
> of case, i.e. the start date of an organization: - sort id start- then
> collapse keeping earliest start date.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/