Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Data cleaning challenge

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: Data cleaning challenge
Date	Tue, 6 Aug 2013 11:53:07 -0400

Here's a working example that illustrates techniques to
identify your bad duplicates. Because of rounding errors in
adding base 10 fractions, all prices are in cents. I assume
that differences in totals are a consequence of duplicate
items and so I discard all items that appear only once per
transaction. I also assume that a bad duplicate can only
occur once per transation item group.

The bad duplicates are easy to identify when the difference
in totals is equal to the sum of the duplicates or a single
item price.

The harder cases involve the sum of two or more prices that
have not been solved with the other techniques. I show a way
to handle a combination of two prices. If needed, the same
idea can be extended to look for differences explained by
the sum of 3 or more separate duplicates.

Since this is already long, I skip the extra overhead needed
to merge back the identified bad duplicates into the
original dataset.

* --------------- begin example ---------------------------

clear
set obs 100000
gen transaction = _n

* create up to 10 items per transaction, set prices in
* cents to avoid rounding errors when generating totals
set seed 123
expand round(runiform() * 10)
gen itemid = _n
gen itemprice = int(runiform() * 500)

* add some real duplicates
expand round(runiform() * 3) if runiform() < .1

* generate the gold standard total price
sort transaction itemid
by transaction: egen totalgold = total(itemprice)

* add duplicates generated by error, do not allow
* more than one bad duplicate per itemid
by transaction itemid: gen one = _n == 1
expand 2 if (runiform() < .1) & one
sort transaction itemid
by transaction: egen totalbad = total(itemprice)
gen diff = totalbad - totalgold

drop one
gen obs = _n

* drop all cases where there is a match
drop if diff == 0

* reduce to real & bad duplicates;
* target one observation per transaction item group
by transaction itemid: keep if _N > 1
by transaction itemid: keep if _n == _N
by transaction: egen totaldup = total(itemprice)

* easy case
gen isbad = itemprice == diff
by transaction: egen nbad = total(isbad)
* but no way to decide if more than one match
tab nbad
list if nbad > 1, sepby(transaction)
drop if nbad
drop nbad

* easy case
replace isbad = totaldup == diff
tab isbad
drop if isbad
drop isbad

* easy case; bad duplicates are == !isgood
gen isgood = (totaldup - itemprice) == diff
by transaction: egen nisgood = total(isgood)
* but no way to decide if more than one match
tab nisgood
list if nisgood > 1, sepby(transaction)
drop if nisgood

* harder case; create all pairwise combinations of
* duplicate items
keep transaction itemprice diff obs
tempfile hard
save "`hard'"
rename itemprice itemprice1
rename obs obs1
joinby transaction using "`hard'"
drop if obs == obs1
sort transaction obs obs1

* identify bad duplicates
gen isbad = diff == (itemprice + itemprice1)
by transaction: egen nbad = total(isbad)
tab nbad
list if nbad > 2, sepby(transaction)
drop if nbad

* what's left requires combinations of more than 2
* duplicate items to explain diff
sort transaction obs
by transaction obs: keep if _n == 1
keep transaction obs itemprice diff

* --------------- end example -----------------------------

On Mon, Aug 5, 2013 at 9:47 AM, Douglas Levy
<[email protected]> wrote:
> Hi All,
> I have a data cleaning challenge where data files have some duplicate
> lines and I'm trying to automate the elimination of those duplicates.
> Consider the following. File A has item-level cash register data and
> some of the items are duplicated in error, though some items are
> duplicated legitimately (e.g. someone bought more than one of an
> item). Therefore, I can't just eliminate all duplicates. File B has
> total transactions amounts (a gold standard), but does not have
> individual item-level data. See below for example data from a merged
> file.
>
> For transaction #29, I can see that File A's transaction total ($2.50)
> is $0.50 over File B's total ($2.00). It is clear that item #50003 was
> entered twice in error. I tagged all items whose price was equal to
> the difference between cash register total and the gold standard total
> using the variable "marker." Now I can eliminate the first item in the
> transaction that is tagged with "marker." I can do similarly with
> transaction #32. However, consider transaction #207. There, it is
> clear that there are *two* duplicates in the transaction (an extra
> item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a
> difference of $1.68). Here the "marker" strategy does not work.
>
> Does anyone have suggestions on how to create an algorithm in Stata
> would tag excess items for deletion in this more complicated example?
> Many thanks!
> -Doug
>
> --TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
>   29   50044     1.5     2.5       2     .5    0
>   29   50003      .5     2.5       2     .5    1
>   29   50003      .5     2.5       2     .5    1
> ------------------------------------------------
>   32   15001    1.25     3.2     2.7     .5    0
>   32   50005      .5     3.2     2.7     .5    1
>   32   50005      .5     3.2     2.7     .5    1
>   32   50035     .95     3.2     2.7     .5    0
> -------------------------------------------------
>   207   56879   1.18    5.11    3.43   1.68    0
>   207   56879   1.18    5.11    3.43   1.68    0
>   207   50005     .5    5.11    3.43   1.68    0
>   207   50035     .5    5.11    3.43   1.68    0
>   207   50168   1.75    5.11    3.43   1.68    0
> Tx = transaction number - a unique transaction identifier (File A)
>
> Item = item number - a unique identifier of specific types of items
> sold (File A)
> Item$ = the item's price (File A)
> Total$ = the transaction total based on cash register data (File A)
> Total$GS = an external gold standard for the transaction total (File B)
> Dif = the difference between the cash register total and the gold standard total
> marker = a marker for items that could explain the difference
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Data cleaning challenge
  - From: Douglas Levy <[email protected]>

Prev by Date: st: Dealing with panel data
Next by Date: RE: st: Relative directory paths in Stata?
Previous by thread: Re: st: Data cleaning challenge
Next by thread: Re: st: Data cleaning challenge
Index(es):
- Date
- Thread