Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Replacing duplicate values
From
"Pavlos C. Symeou" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: Replacing duplicate values
Date
Tue, 06 Apr 2010 15:21:29 +0200
Dear Martin and Nick,
thank you for your input in a previous inquiry, which has worked in many
instances. However, I am experiencing problems with the process of
reshaping the data to/from long/wide formats. The datasets I am working
on are big (more than 1 GB) and consist of about 600 string variables.
Reshaping back and forth not only does it take ages (I am working on a
powerful Windows Vista 64 pc, quad-core and have Stata 11 MP) to
complete but creates enormous files which I can't handle. I would like
to ask whether you have any alternatives to the ones below. Allow me
first to explain again the task.
I have a dataset which concerns patents. Every patent is citing other
patents. Every patent may cite multiple existing patents. The dataset
appears in wide format where I have a patent's id and the number of its
citations (citation_1, citation_2, etc.). However, there are mistakes in
the dataset and certain citations appear more than once for a single
patent, which is meaningless. Examples are patents with id 1 and id 2
where citation AAAA appears twice. Patent with id 3, has three citations
but they show in places 2,3, and 4 (a similar issue happens with patent
with id 4).
id citation_1 citation_2 citation_3 citation_4 citation_5
1 AAAA BBBB CCCC AAAA
2 NICK NICK MARTIN NICK
3 YYYY NNNN PAVLO
4 ZZZZ FFFF TRDFF
5
.
The task is to delete duplicate values for each observation and move the remaining values to the left towards citation_1. For example patents with id 2 and id 3 would look like this:
id citation_1 citation_2 citation_3 citation_4 citation_5
2 NICK MARTIN
3 YYYY NNNN PAVLOS
.
You suggested I used the following code that simply removes the duplicates:
***********************************************************************
reshape long ipc_, i(id)
bysort id ipc_: gen superfluousandredundant = _n> 1
replace ipc_="" if superfluousandredundant==1
drop superfluousandredundant
***********************************************************************
Further, I have used the following code to reallocate the values of each observation to the left:
******************************************************************************************************
g unit=1
bysort id: generate runsum = sum(unit) if ipc_!=""
rename runsum _runsum
sort id _runsum
bysort id: g n=_n
replace _runsum=n if _runsum==.
drop _j unit n
reshape wide ipc, i(id) j(_runsum)
****************************************************************************************************
The problem is that both pieces of code use -reshape- which only works when my dataset (I have dataset for each of a sample of 300 companies) is very small. Can you suggest another way around to achieve the above task?
Best wishes,
Pavlos
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/