Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Data cleaning challenge
From
Douglas Levy <[email protected]>
To
[email protected]
Subject
st: Data cleaning challenge
Date
Mon, 5 Aug 2013 09:47:54 -0400
Hi All,
I have a data cleaning challenge where data files have some duplicate
lines and I'm trying to automate the elimination of those duplicates.
Consider the following. File A has item-level cash register data and
some of the items are duplicated in error, though some items are
duplicated legitimately (e.g. someone bought more than one of an
item). Therefore, I can't just eliminate all duplicates. File B has
total transactions amounts (a gold standard), but does not have
individual item-level data. See below for example data from a merged
file.
For transaction #29, I can see that File A's transaction total ($2.50)
is $0.50 over File B's total ($2.00). It is clear that item #50003 was
entered twice in error. I tagged all items whose price was equal to
the difference between cash register total and the gold standard total
using the variable "marker." Now I can eliminate the first item in the
transaction that is tagged with "marker." I can do similarly with
transaction #32. However, consider transaction #207. There, it is
clear that there are *two* duplicates in the transaction (an extra
item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a
difference of $1.68). Here the "marker" strategy does not work.
Does anyone have suggestions on how to create an algorithm in Stata
would tag excess items for deletion in this more complicated example?
Many thanks!
-Doug
--TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
29 50044 1.5 2.5 2 .5 0
29 50003 .5 2.5 2 .5 1
29 50003 .5 2.5 2 .5 1
------------------------------------------------
32 15001 1.25 3.2 2.7 .5 0
32 50005 .5 3.2 2.7 .5 1
32 50005 .5 3.2 2.7 .5 1
32 50035 .95 3.2 2.7 .5 0
-------------------------------------------------
207 56879 1.18 5.11 3.43 1.68 0
207 56879 1.18 5.11 3.43 1.68 0
207 50005 .5 5.11 3.43 1.68 0
207 50035 .5 5.11 3.43 1.68 0
207 50168 1.75 5.11 3.43 1.68 0
Tx = transaction number - a unique transaction identifier (File A)
Item = item number - a unique identifier of specific types of items
sold (File A)
Item$ = the item's price (File A)
Total$ = the transaction total based on cash register data (File A)
Total$GS = an external gold standard for the transaction total (File B)
Dif = the difference between the cash register total and the gold standard total
marker = a marker for items that could explain the difference
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/