Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Data cleaning challenge

From	Michael Barker <[email protected]>
To	[email protected]
Subject	Re: st: Data cleaning challenge
Date	Mon, 5 Aug 2013 15:10:38 -0400

Hi Doug,

You could start by dropping items in transactions where all duplicates
are invalid. This would cover all of the examples that you gave, but
it would not cover transactions with some valid and some invalid
duplicates.

Assuming you have a single data set like the examples, the code would
look something like this:

* number all duplicate items, 2 and greater are duplicates
sort TX ITEM
by TX ITEM : gen dupitem==_n

* generate total dollar amount of  excess duplicates
bys TX: egen duptotal = total(ITEM$) if dupitem>=2

* check if removing duplicates would make TOTAL$ equal to TOTAL$GS
gen invalid_dup = 0
replace invalid_dup = 1 if dupitem>=2 & TOTAL$-duptotal==TOTAL$GS

* drop all invalid duplicates
drop if invalid_dup==1

That would get the easy ones, and would not drop any valid duplicates.
It would leave the more difficult task of finding duplicates with
valid and invalid duplicates in the same transaction. But I would do
the easy part first, then see what problems are still remaining.

Mike



On Mon, Aug 5, 2013 at 9:47 AM, Douglas Levy
<[email protected]> wrote:
> Hi All,
> I have a data cleaning challenge where data files have some duplicate
> lines and I'm trying to automate the elimination of those duplicates.
> Consider the following. File A has item-level cash register data and
> some of the items are duplicated in error, though some items are
> duplicated legitimately (e.g. someone bought more than one of an
> item). Therefore, I can't just eliminate all duplicates. File B has
> total transactions amounts (a gold standard), but does not have
> individual item-level data. See below for example data from a merged
> file.
>
> For transaction #29, I can see that File A's transaction total ($2.50)
> is $0.50 over File B's total ($2.00). It is clear that item #50003 was
> entered twice in error. I tagged all items whose price was equal to
> the difference between cash register total and the gold standard total
> using the variable "marker." Now I can eliminate the first item in the
> transaction that is tagged with "marker." I can do similarly with
> transaction #32. However, consider transaction #207. There, it is
> clear that there are *two* duplicates in the transaction (an extra
> item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a
> difference of $1.68). Here the "marker" strategy does not work.
>
> Does anyone have suggestions on how to create an algorithm in Stata
> would tag excess items for deletion in this more complicated example?
> Many thanks!
> -Doug
>
> --TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
>   29   50044     1.5     2.5       2     .5    0
>   29   50003      .5     2.5       2     .5    1
>   29   50003      .5     2.5       2     .5    1
> ------------------------------------------------
>   32   15001    1.25     3.2     2.7     .5    0
>   32   50005      .5     3.2     2.7     .5    1
>   32   50005      .5     3.2     2.7     .5    1
>   32   50035     .95     3.2     2.7     .5    0
> -------------------------------------------------
>   207   56879   1.18    5.11    3.43   1.68    0
>   207   56879   1.18    5.11    3.43   1.68    0
>   207   50005     .5    5.11    3.43   1.68    0
>   207   50035     .5    5.11    3.43   1.68    0
>   207   50168   1.75    5.11    3.43   1.68    0
> Tx = transaction number - a unique transaction identifier (File A)
>
> Item = item number - a unique identifier of specific types of items
> sold (File A)
> Item$ = the item's price (File A)
> Total$ = the transaction total based on cash register data (File A)
> Total$GS = an external gold standard for the transaction total (File B)
> Dif = the difference between the cash register total and the gold standard total
> marker = a marker for items that could explain the difference
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Data cleaning challenge
  - From: Douglas Levy <[email protected]>

Prev by Date: Re: st: Problem with Stata handling of large dataset
Next by Date: Re: st: Odd SEM Results
Previous by thread: st: Data cleaning challenge
Next by thread: Re: st: Data cleaning challenge
Index(es):
- Date
- Thread