Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Comparing two data set
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Comparing two data set
Date
Wed, 2 Mar 2011 09:43:57 +0000
The answer is Yes, and follows from looking at the help for -duplicates-.
Following the example in my previous, let's introduce an oddity and
then show how you find it.
. replace mpg = 42 in 42
(1 real change made)
. duplicates report make-foreign
Duplicates in terms of make price mpg rep78 headroom trunk weight
length turn displacement
gear_ratio foreign
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 2 0
2 | 146 73
--------------------------------------
-duplicates- reports two observations that are singletons, i.e. occur
precisely once. We create a tag variable (which will be 0 for the
singletons).
. duplicates tag make-foreign, gen(tag)
Duplicates in terms of make price mpg rep78 headroom trunk weight
length turn displacement
gear_ratio foreign
. l if tag == 0
+------------------------------------------------------------------------------------------+
42. | make | price | mpg | rep78 | headroom | trunk | weight |
length | turn | displa~t |
| Plym. Arrow | 4,647 | 42 | 3 | 2.0 | 11 | 3,260 |
170 | 37 | 156 |
|------------------------------------------------------------------------------------------|
| gear_r~o | foreign | ds
| tag |
| 3.05 | Domestic | 2
| 0 |
+------------------------------------------------------------------------------------------+
+------------------------------------------------------------------------------------------+
116. | make | price | mpg | rep78 | headroom | trunk | weight |
length | turn | displa~t |
| Plym. Arrow | 4,647 | 28 | 3 | 2.0 | 11 | 3,260 |
170 | 37 | 156 |
|------------------------------------------------------------------------------------------|
| gear_r~o | foreign | ds
| tag |
| 3.05 | Domestic | 1
| 0 |
+------------------------------------------------------------------------------------------+
So, you can home in on anomalies in any standard way.
Nick
On Wed, Mar 2, 2011 at 9:25 AM, Rajaram Subramanian Potty
<[email protected]> wrote:
> Dear Nick,
>
> Thanks for the information. Twor or three times I used the -cf-
> command to identify the errors in two data files. But I want the error
> should be displayed according to the ID variable. But presently, the
> -cf- command gives error by observation number in the Stata data set
> and not by the ID variable. If I will be able to generate the errors
> according to the ID variable, it will be easy for use to trace
> questionnaire and find the error in the data entry. So, I just want to
> know whether it is possible to get the error listed by the ID vriable.
>
> Thanks and regards,
>
> RAJARAM. S
>
> On Wed, Mar 2, 2011 at 2:44 PM, Nick Cox <[email protected]> wrote:
>> One way is to check that the .dta or other data files are identical
>> using your operating system.
>>
>> Also, check out -cf- and -dta_equal-.
>>
>> Another way to approach this is to -append- the datasets and look for
>> -duplicates-. However, -duplicates- just looks for duplicate
>> observations. In principle, the variable names, variable labels, value
>> labels, formats and characteristics must also be shown to be
>> identical.
>>
>> To do this last, you will need to create a dataset identifier so that
>> you can work out where any anomalies are.
>>
>> Here is an example where by construction the interesting part of the
>> data is identical. So, -duplicates- confirms that everything occurs
>> twice. Conversely, mismatches would imply singletons, triplicates,
>> etc.
>>
>> . sysuse auto
>> (1978 Automobile Data)
>>
>> . gen ds = 1
>>
>> . save auto1
>> file auto1.dta saved
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . gen ds = 2
>>
>> . append using auto1
>> (label origin already defined)
>>
>>
>> . tab ds
>>
>> ds | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 1 | 74 50.00 50.00
>> 2 | 74 50.00 100.00
>> ------------+-----------------------------------
>> Total | 148 100.00
>>
>> . duplicates report make-foreign
>>
>> Duplicates in terms of make price mpg rep78 headroom trunk weight
>> length turn displacement
>> gear_ratio foreign
>>
>> --------------------------------------
>> copies | observations surplus
>> ----------+---------------------------
>> 2 | 148 74
>> --------------------------------------
>>
>> Nick
>>
>> On Wed, Mar 2, 2011 at 9:01 AM, Rajaram Subramanian Potty
>> <[email protected]> wrote:
>>
>>> We are carried out a survey and the data from the survey was entered
>>> two times. Now, we want to compare these two data files for possible
>>> data etnry errors. Please, inform how to compare the two data files
>>> and identify the data entry error using stata.
>> *
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/