Radu Ban
> The data is organized like this, numbers are made-up for this
description:
>
> id dummy descriptor
> 13 1 <blank>
> 13 0 abc
> 13 1 <blank>
> 14 0 <blank>
> 14 0 def
> 14 0 def
>
> The idea is that the id variable should be unique, but for some
> reason it is not. This means that both the dummy and descriptor
> should have the same values accross the id groups. A complication
> is that for the dummy, if there's a "1" in a group all the group
> should be "1".
>
> I want to reduce this to a clean version which looks like this:
>
> id dummy descriptor
> 13 1 abc
> 14 0 def
>
> For the dummy part I dealt with it like this (probably a convoluted
method):
> bysort id: egen maxdummy = max(dummy)
> replace dummy = maxdummy
> bysort id: keep if _n == 1
>
> But I am a bit stuck on how to deal with the string descriptor. I
> mean I know one way of doing by splitting the data and then
> merging it back but there has to be a more efficient way.
I think you are right: you can do all you want in one place.
The dummy can be sorted out your way, or this way:
bysort id (dummy) : replace dummy = dummy[_N]
as 1s will get sorted to the end.
If I understand correctly, the descriptor can be
sorted out similarly
bysort id (descriptor) : replace descriptor = descriptor[_N]
as the empty strings will get sorted to the beginning.
However, before you do that you should test the
assumption that all (non-empty) descriptors are
identical within -id-:
gen empty = mi(descriptor)
bysort id empty (descriptor) :
assert descriptor[1] == descriptor[_N]
On the last, see also
http://www.stata.com/support/faqs/data/diff.html
Nick
[email protected]
<<attachment: winmail.dat>>