Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: RE: Fwd: identifying duplicate entry errors
From
Joe Canner <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: RE: Fwd: identifying duplicate entry errors
Date
Tue, 25 Feb 2014 18:56:31 +0000
Sergiy,
Thanks for the caveats. I can't say that I've thought through this too much, nor do I have any immediate plans in this direction. Although I work a lot with big data sets, the circumstances where I am most often looking for duplicates tend to be smaller, locally-collected data sets where one can actually investigate and fix duplicates. But I take your point that summarizing how two records are different can be problematic. Perhaps one would need to specify the level of dis-similarity that they are interested in, e.g., if I am looking a 10 variables, list the records that differ on at most 2 variables, and list those variables. If the latter number is small, the output shouldn't be too bad.
My original interest in modifying the -duplicates- command stems from a desire to have a more informative -duplicates list- function, i.e., to be able to list other variables besides the one that match. That, to me, would be very useful in determine how similar the so-called duplicates actually are. However, -duplicates list- will only list the variables that are used to determine duplication.
Perhaps Nick's answer to this question will deal with my original question as well.
Regards,
Joe
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Sergiy Radyakin
Sent: Tuesday, February 25, 2014 12:41 PM
To: [email protected]
Subject: Re: st: RE: Fwd: identifying duplicate entry errors
Joe, programming this is perhaps the easiest part. But, how do you imagine the output of this command?
I hope you don't want to list the values.
For example for the case you have 1mln obs, with roughly 20% duplicates on 10vars and differences in any of the other 30vars with multiple duplications (not just 2, but say 22 with one id).
diff of first vs second
diff of first vs third
...
diff of first vs twenty-second
diff of second vs third
...
diff of twenty-first vs twenty-second?
How big would the output be for the case of N obs, with VAR1=const, and VAR2=_n? I am thinking factorial(N-1)?
If you only want to list the vars that are different, then it might not hold for all the duplication groups.
id age income
1 33 1
1 33 2
1 32 3
2 45 70
2 46 71
2 47 77
dups id, diff()
Should the output be : income? (all obs are different in income), or
age+income? (some observations also differ by age). Now imagine you
have a 1mln -obs dataset.
Best, Sergiy Radyakin
On Tue, Feb 25, 2014 at 12:15 PM, Joe Canner <[email protected]> wrote:
> Alison,
>
> Have you tried -duplicates list-? This is probably not as helpful as you would like, but it's a start. I have had similar wishes for the -duplicate- command. If there are no ideas forthcoming in response to your question, perhaps it is time to write an enhancement of the -duplicates- command.
>
> Regards,
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Alison El
> Ayadi
> Sent: Tuesday, February 25, 2014 11:07 AM
> To: [email protected]
> Subject: st: Fwd: identifying duplicate entry errors
>
> Dear Statalisters,
>
> I am working to identify duplicates within a very messy dataset and would love to be able to identify among those observations which have the same values for a set of variables (that I define) what are the variables where their values differ (how are they not true duplicates).
>
> Does anyone have any ideas about this?
> Thanks so much,
> Alison
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/