Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Fwd: identifying duplicate entry errors
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: RE: Fwd: identifying duplicate entry errors
Date
Tue, 25 Feb 2014 17:56:31 +0000
The aim of -duplicates- is, hmm, to identify duplicates. But it is not
the only tool to identify duplicates. Let's suppose first that you
want to identify duplicates based on -a b c-. Then
egen group = group(a b c), label
groups observations identical on -a b c-.
su group
tells you how many groups that means. Now suppose you have further
interest in -d e f-. Consider
bysort group (d) :
If you sort on -d- within distinct groups of -group- then any
different values are shaken apart. (In fact, you don't need to create
-group- to do this.). Let's take this further to identify what's
variable within distinct values of -group-.
gen whatvaries = ""
foreach v in d e f {
bysort group (`v') : replace whatvaries = ///
whatvaries + cond(`v'[_N] != `v'[1], "`v' ", "")
}
The analysis of -whatvaries- might not be easy, but it's what you seem
to be asking for.
See also
How do I list observations in a group that differ on a variable?
http://www.stata.com/support/faqs/data-management/listing-observations-in-group/
Nick
[email protected]
On 25 February 2014 17:40, Alison El Ayadi <[email protected]> wrote:
> Thanks for your suggestion. I have run through a number of different
> combinations of listing the duplicates, but I suspect that there are
> duplicates that I can identify when limiting to certain variables but
> that I do not obtain when including all variables due to data entry
> errors. That's why it would be so great to have something that says
> these are duplicate groups based on var x, var, and var z, and here is
> the total number of variables that differ between the duplicate
> observation.
>
> Best,
> Alison
>
> On Tue, Feb 25, 2014 at 9:15 AM, Joe Canner <[email protected]> wrote:
>> Alison,
>>
>> Have you tried -duplicates list-? This is probably not as helpful as you would like, but it's a start. I have had similar wishes for the -duplicate- command. If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Alison El Ayadi
>> Sent: Tuesday, February 25, 2014 11:07 AM
>> To: [email protected]
>> Subject: st: Fwd: identifying duplicate entry errors
>>
>> Dear Statalisters,
>>
>> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/