Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Fwd: identifying duplicate entry errors
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: RE: Fwd: identifying duplicate entry errors
Date
Tue, 25 Feb 2014 12:40:53 -0500
Joe, programming this is perhaps the easiest part. But, how do you
imagine the output of this command?
I hope you don't want to list the values.
For example for the case you have 1mln obs, with roughly 20%
duplicates on 10vars and differences in any of the other 30vars with
multiple duplications (not just 2, but say 22 with one id).
diff of first vs second
diff of first vs third
...
diff of first vs twenty-second
diff of second vs third
...
diff of twenty-first vs twenty-second?
How big would the output be for the case of N obs, with VAR1=const,
and VAR2=_n? I am thinking factorial(N-1)?
If you only want to list the vars that are different, then it might
not hold for all the duplication groups.
id age income
1 33 1
1 33 2
1 32 3
2 45 70
2 46 71
2 47 77
dups id, diff()
Should the output be : income? (all obs are different in income), or
age+income? (some observations also differ by age). Now imagine you
have a 1mln -obs dataset.
Best, Sergiy Radyakin
On Tue, Feb 25, 2014 at 12:15 PM, Joe Canner <[email protected]> wrote:
> Alison,
>
> Have you tried -duplicates list-? This is probably not as helpful as you would like, but it's a start. I have had similar wishes for the -duplicate- command. If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>
> Regards,
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Alison El Ayadi
> Sent: Tuesday, February 25, 2014 11:07 AM
> To: [email protected]
> Subject: st: Fwd: identifying duplicate entry errors
>
> Dear Statalisters,
>
> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).
>
> Does anyone have any ideas about this?
> Thanks so much,
> Alison
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/