Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Data cleaning challenge

From	Douglas Levy <[email protected]>
To	[email protected]
Subject	st: Data cleaning challenge
Date	Mon, 5 Aug 2013 09:47:54 -0400

Hi All,
I have a data cleaning challenge where data files have some duplicate
lines and I'm trying to automate the elimination of those duplicates.
Consider the following. File A has item-level cash register data and
some of the items are duplicated in error, though some items are
duplicated legitimately (e.g. someone bought more than one of an
item). Therefore, I can't just eliminate all duplicates. File B has
total transactions amounts (a gold standard), but does not have
individual item-level data. See below for example data from a merged
file.

For transaction #29, I can see that File A's transaction total ($2.50)
is $0.50 over File B's total ($2.00). It is clear that item #50003 was
entered twice in error. I tagged all items whose price was equal to
the difference between cash register total and the gold standard total
using the variable "marker." Now I can eliminate the first item in the
transaction that is tagged with "marker." I can do similarly with
transaction #32. However, consider transaction #207. There, it is
clear that there are *two* duplicates in the transaction (an extra
item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a
difference of $1.68). Here the "marker" strategy does not work.

Does anyone have suggestions on how to create an algorithm in Stata
would tag excess items for deletion in this more complicated example?
Many thanks!
-Doug

--TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
  29   50044     1.5     2.5       2     .5    0
  29   50003      .5     2.5       2     .5    1
  29   50003      .5     2.5       2     .5    1
------------------------------------------------
  32   15001    1.25     3.2     2.7     .5    0
  32   50005      .5     3.2     2.7     .5    1
  32   50005      .5     3.2     2.7     .5    1
  32   50035     .95     3.2     2.7     .5    0
-------------------------------------------------
  207   56879   1.18    5.11    3.43   1.68    0
  207   56879   1.18    5.11    3.43   1.68    0
  207   50005     .5    5.11    3.43   1.68    0
  207   50035     .5    5.11    3.43   1.68    0
  207   50168   1.75    5.11    3.43   1.68    0
Tx = transaction number - a unique transaction identifier (File A)

Item = item number - a unique identifier of specific types of items
sold (File A)
Item$ = the item's price (File A)
Total$ = the transaction total based on cash register data (File A)
Total$GS = an external gold standard for the transaction total (File B)
Dif = the difference between the cash register total and the gold standard total
marker = a marker for items that could explain the difference
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Data cleaning challenge
  - From: Robert Picard <[email protected]>
- Re: st: Data cleaning challenge
  - From: Michael Barker <[email protected]>

Prev by Date: st: Problem with Stata handling of large dataset
Next by Date: st: Re: Bootstrapping with unbalanced panel
Previous by thread: st: Problem with Stata handling of large dataset
Next by thread: Re: st: Data cleaning challenge
Index(es):
- Date
- Thread