Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Data cleaning challenge |
Date | Tue, 6 Aug 2013 11:53:07 -0400 |
Here's a working example that illustrates techniques to identify your bad duplicates. Because of rounding errors in adding base 10 fractions, all prices are in cents. I assume that differences in totals are a consequence of duplicate items and so I discard all items that appear only once per transaction. I also assume that a bad duplicate can only occur once per transation item group. The bad duplicates are easy to identify when the difference in totals is equal to the sum of the duplicates or a single item price. The harder cases involve the sum of two or more prices that have not been solved with the other techniques. I show a way to handle a combination of two prices. If needed, the same idea can be extended to look for differences explained by the sum of 3 or more separate duplicates. Since this is already long, I skip the extra overhead needed to merge back the identified bad duplicates into the original dataset. * --------------- begin example --------------------------- clear set obs 100000 gen transaction = _n * create up to 10 items per transaction, set prices in * cents to avoid rounding errors when generating totals set seed 123 expand round(runiform() * 10) gen itemid = _n gen itemprice = int(runiform() * 500) * add some real duplicates expand round(runiform() * 3) if runiform() < .1 * generate the gold standard total price sort transaction itemid by transaction: egen totalgold = total(itemprice) * add duplicates generated by error, do not allow * more than one bad duplicate per itemid by transaction itemid: gen one = _n == 1 expand 2 if (runiform() < .1) & one sort transaction itemid by transaction: egen totalbad = total(itemprice) gen diff = totalbad - totalgold drop one gen obs = _n * drop all cases where there is a match drop if diff == 0 * reduce to real & bad duplicates; * target one observation per transaction item group by transaction itemid: keep if _N > 1 by transaction itemid: keep if _n == _N by transaction: egen totaldup = total(itemprice) * easy case gen isbad = itemprice == diff by transaction: egen nbad = total(isbad) * but no way to decide if more than one match tab nbad list if nbad > 1, sepby(transaction) drop if nbad drop nbad * easy case replace isbad = totaldup == diff tab isbad drop if isbad drop isbad * easy case; bad duplicates are == !isgood gen isgood = (totaldup - itemprice) == diff by transaction: egen nisgood = total(isgood) * but no way to decide if more than one match tab nisgood list if nisgood > 1, sepby(transaction) drop if nisgood * harder case; create all pairwise combinations of * duplicate items keep transaction itemprice diff obs tempfile hard save "`hard'" rename itemprice itemprice1 rename obs obs1 joinby transaction using "`hard'" drop if obs == obs1 sort transaction obs obs1 * identify bad duplicates gen isbad = diff == (itemprice + itemprice1) by transaction: egen nbad = total(isbad) tab nbad list if nbad > 2, sepby(transaction) drop if nbad * what's left requires combinations of more than 2 * duplicate items to explain diff sort transaction obs by transaction obs: keep if _n == 1 keep transaction obs itemprice diff * --------------- end example ----------------------------- On Mon, Aug 5, 2013 at 9:47 AM, Douglas Levy <douglas_levy@post.harvard.edu> wrote: > Hi All, > I have a data cleaning challenge where data files have some duplicate > lines and I'm trying to automate the elimination of those duplicates. > Consider the following. File A has item-level cash register data and > some of the items are duplicated in error, though some items are > duplicated legitimately (e.g. someone bought more than one of an > item). Therefore, I can't just eliminate all duplicates. File B has > total transactions amounts (a gold standard), but does not have > individual item-level data. See below for example data from a merged > file. > > For transaction #29, I can see that File A's transaction total ($2.50) > is $0.50 over File B's total ($2.00). It is clear that item #50003 was > entered twice in error. I tagged all items whose price was equal to > the difference between cash register total and the gold standard total > using the variable "marker." Now I can eliminate the first item in the > transaction that is tagged with "marker." I can do similarly with > transaction #32. However, consider transaction #207. There, it is > clear that there are *two* duplicates in the transaction (an extra > item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a > difference of $1.68). Here the "marker" strategy does not work. > > Does anyone have suggestions on how to create an algorithm in Stata > would tag excess items for deletion in this more complicated example? > Many thanks! > -Doug > > --TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER- > 29 50044 1.5 2.5 2 .5 0 > 29 50003 .5 2.5 2 .5 1 > 29 50003 .5 2.5 2 .5 1 > ------------------------------------------------ > 32 15001 1.25 3.2 2.7 .5 0 > 32 50005 .5 3.2 2.7 .5 1 > 32 50005 .5 3.2 2.7 .5 1 > 32 50035 .95 3.2 2.7 .5 0 > ------------------------------------------------- > 207 56879 1.18 5.11 3.43 1.68 0 > 207 56879 1.18 5.11 3.43 1.68 0 > 207 50005 .5 5.11 3.43 1.68 0 > 207 50035 .5 5.11 3.43 1.68 0 > 207 50168 1.75 5.11 3.43 1.68 0 > Tx = transaction number - a unique transaction identifier (File A) > > Item = item number - a unique identifier of specific types of items > sold (File A) > Item$ = the item's price (File A) > Total$ = the transaction total based on cash register data (File A) > Total$GS = an external gold standard for the transaction total (File B) > Dif = the difference between the cash register total and the gold standard total > marker = a marker for items that could explain the difference > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/