Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: Replacing duplicate values
From
"Nick Cox" <[email protected]>
To
<[email protected]>
Subject
st: RE: Replacing duplicate values
Date
Thu, 1 Apr 2010 16:00:17 +0100
It's a Stata two-step: reshape, drop duplicates, reshape back. Something like
* warning: untested code
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n > 1
drop if superfluousandredundant
bysort id (ipc) : gen j = _n
reshape wide ipc, i(id) j(j)
Actually, the last -reshape- might not be a good idea. The long structure might be more useful.
Nick
[email protected]
Pavlos C. Symeou
I have a dataset which concerns patents. Every patent is assigned a
number of International Patent Classifications (IPCs). However, there
are mistakes in the database and certain IPCs appear more than once for
a single patent, which is meaningless. Examples are patents with id 6
and id 7 (ipc_1, ipc_2 etc list the number of IPCs a single patent is
assigned). For the patent with id 6 we can see that ipc_2 and ipc_3 are
the same. Id 7 illustrates a more general issue. Duplicate values may
not appear sequentially and may appear more than twice.
id ipc_1 ipc_2 ipc_3 ipc_4
1 A44B G09F H04N
2 A47B G06F H05K E05D
3 A47B G06F
4 A47B H04N H05K
5 A47B
6 A47B F16M F16M H05K
7 A47B A47B F16M A47B
Can you suggest a way to delete the duplicate values, which can be more
than two, and move the remaining to the left? For example patents with
id 6 and id 7 would look like this:
id ipc_1 ipc_2 ipc_3 ipc_4
6 A47B F16M H05K
7 A47B F16M
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/