Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Comparing two data set

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Comparing two data set
Date	Wed, 2 Mar 2011 09:43:57 +0000

The answer is Yes, and follows from looking at the help for -duplicates-.

Following the example in my previous, let's introduce an oddity and
then show how you find it.

. replace mpg = 42 in 42
(1 real change made)

. duplicates report make-foreign

Duplicates in terms of make price mpg rep78 headroom trunk weight
length turn displacement
    gear_ratio foreign

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |            2             0
        2 |          146            73
--------------------------------------

-duplicates- reports two observations that are singletons, i.e. occur
precisely once. We create a tag variable (which will be 0 for the
singletons).

. duplicates tag make-foreign, gen(tag)

Duplicates in terms of make price mpg rep78 headroom trunk weight
length turn displacement
    gear_ratio foreign

. l if tag == 0

     +------------------------------------------------------------------------------------------+
 42. | make        | price | mpg | rep78 | headroom | trunk | weight |
length | turn | displa~t |
     | Plym. Arrow | 4,647 |  42 |     3 |      2.0 |    11 |  3,260 |
   170 |   37 |      156 |
     |------------------------------------------------------------------------------------------|
     |        gear_r~o        |         foreign        |        ds
    |        tag         |
     |            3.05        |        Domestic        |         2
    |          0         |
     +------------------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------------------+
116. | make        | price | mpg | rep78 | headroom | trunk | weight |
length | turn | displa~t |
     | Plym. Arrow | 4,647 |  28 |     3 |      2.0 |    11 |  3,260 |
   170 |   37 |      156 |
     |------------------------------------------------------------------------------------------|
     |        gear_r~o        |         foreign        |        ds
    |        tag         |
     |            3.05        |        Domestic        |         1
    |          0         |
     +------------------------------------------------------------------------------------------+

So, you can home in on anomalies in any standard way.

Nick

On Wed, Mar 2, 2011 at 9:25 AM, Rajaram Subramanian Potty
<[email protected]> wrote:
> Dear Nick,
>
> Thanks for the information. Twor or three times I used the -cf-
> command to identify the errors in two data files. But I want the error
> should be displayed according to the ID variable. But presently, the
> -cf-  command gives error by observation number in the Stata data set
> and not by the ID variable. If I will be able to generate the errors
> according to the ID variable, it will be easy for use to trace
> questionnaire and find the error in the data entry. So, I just want to
> know whether it is possible to get the error listed by the ID vriable.
>
> Thanks and regards,
>
> RAJARAM. S
>
> On Wed, Mar 2, 2011 at 2:44 PM, Nick Cox <[email protected]> wrote:
>> One way is to check that the .dta or other data files are identical
>> using your operating system.
>>
>> Also, check out -cf- and -dta_equal-.
>>
>> Another way to approach this is to -append- the datasets and look for
>> -duplicates-. However, -duplicates- just looks for duplicate
>> observations. In principle, the variable names, variable labels, value
>> labels, formats and characteristics must also be shown to be
>> identical.
>>
>> To do this last, you will need to create a dataset identifier so that
>> you can work out where any anomalies are.
>>
>> Here is an example where by construction the interesting part of the
>> data is identical. So, -duplicates- confirms that everything occurs
>> twice. Conversely, mismatches would imply singletons, triplicates,
>> etc.
>>
>> . sysuse auto
>> (1978 Automobile Data)
>>
>> . gen ds = 1
>>
>> . save auto1
>> file auto1.dta saved
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . gen ds = 2
>>
>> . append using auto1
>> (label origin already defined)
>>
>>
>> . tab ds
>>
>>         ds |      Freq.     Percent        Cum.
>> ------------+-----------------------------------
>>          1 |         74       50.00       50.00
>>          2 |         74       50.00      100.00
>> ------------+-----------------------------------
>>      Total |        148      100.00
>>
>> . duplicates report make-foreign
>>
>> Duplicates in terms of make price mpg rep78 headroom trunk weight
>> length turn displacement
>>    gear_ratio foreign
>>
>> --------------------------------------
>>   copies | observations       surplus
>> ----------+---------------------------
>>        2 |          148            74
>> --------------------------------------
>>
>> Nick
>>
>> On Wed, Mar 2, 2011 at 9:01 AM, Rajaram Subramanian Potty
>> <[email protected]> wrote:
>>
>>> We are carried out a survey and the data from the survey was entered
>>> two times. Now, we want to compare these two data files for possible
>>> data etnry errors. Please, inform how to compare the two data files
>>> and identify the data entry error using stata.
>> *

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: Comparing two data set
  - From: "Kevin Owuor" <[email protected]>

References:
- st: Comparing two data set
  - From: Rajaram Subramanian Potty <[email protected]>
- Re: st: Comparing two data set
  - From: Nick Cox <[email protected]>
- Re: st: Comparing two data set
  - From: Rajaram Subramanian Potty <[email protected]>

Prev by Date: st: re:xtmixed model
Next by Date: Re: st: Comparing two data set
Previous by thread: Re: st: Comparing two data set
Next by thread: RE: st: Comparing two data set
Index(es):
- Date
- Thread