Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Comparing two data set
From
Rajaram Subramanian Potty <[email protected]>
To
[email protected]
Subject
Re: st: Comparing two data set
Date
Wed, 2 Mar 2011 17:12:37 +0530
Thank you very much for the information. Installed the -cf3- and able
to generate the error list by the ID.
RAJARAM. S
On Wed, Mar 2, 2011 at 3:33 PM, Kevin Owuor <[email protected]> wrote:
> Maybe you can Tryout cf3 package type --findit cf3--.it lists errors by id
> ----------
> Kevin Owuor
> Kemri/ucsf
> Kenya
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: Wednesday, March 02, 2011 12:44 PM
> To: [email protected]
> Subject: Re: st: Comparing two data set
>
> The answer is Yes, and follows from looking at the help for -duplicates-.
>
> Following the example in my previous, let's introduce an oddity and
> then show how you find it.
>
> . replace mpg = 42 in 42
> (1 real change made)
>
> . duplicates report make-foreign
>
> Duplicates in terms of make price mpg rep78 headroom trunk weight
> length turn displacement
> gear_ratio foreign
>
> --------------------------------------
> copies | observations surplus
> ----------+---------------------------
> 1 | 2 0
> 2 | 146 73
> --------------------------------------
>
> -duplicates- reports two observations that are singletons, i.e. occur
> precisely once. We create a tag variable (which will be 0 for the
> singletons).
>
> . duplicates tag make-foreign, gen(tag)
>
> Duplicates in terms of make price mpg rep78 headroom trunk weight
> length turn displacement
> gear_ratio foreign
>
> . l if tag == 0
>
>
> +---------------------------------------------------------------------------
> ---------------+
> 42. | make | price | mpg | rep78 | headroom | trunk | weight |
> length | turn | displa~t |
> | Plym. Arrow | 4,647 | 42 | 3 | 2.0 | 11 | 3,260 |
> 170 | 37 | 156 |
>
> |---------------------------------------------------------------------------
> ---------------|
> | gear_r~o | foreign | ds
> | tag |
> | 3.05 | Domestic | 2
> | 0 |
>
> +---------------------------------------------------------------------------
> ---------------+
>
>
> +---------------------------------------------------------------------------
> ---------------+
> 116. | make | price | mpg | rep78 | headroom | trunk | weight |
> length | turn | displa~t |
> | Plym. Arrow | 4,647 | 28 | 3 | 2.0 | 11 | 3,260 |
> 170 | 37 | 156 |
>
> |---------------------------------------------------------------------------
> ---------------|
> | gear_r~o | foreign | ds
> | tag |
> | 3.05 | Domestic | 1
> | 0 |
>
> +---------------------------------------------------------------------------
> ---------------+
>
> So, you can home in on anomalies in any standard way.
>
> Nick
>
> On Wed, Mar 2, 2011 at 9:25 AM, Rajaram Subramanian Potty
> <[email protected]> wrote:
>> Dear Nick,
>>
>> Thanks for the information. Twor or three times I used the -cf-
>> command to identify the errors in two data files. But I want the error
>> should be displayed according to the ID variable. But presently, the
>> -cf- command gives error by observation number in the Stata data set
>> and not by the ID variable. If I will be able to generate the errors
>> according to the ID variable, it will be easy for use to trace
>> questionnaire and find the error in the data entry. So, I just want to
>> know whether it is possible to get the error listed by the ID vriable.
>>
>> Thanks and regards,
>>
>> RAJARAM. S
>>
>> On Wed, Mar 2, 2011 at 2:44 PM, Nick Cox <[email protected]> wrote:
>>> One way is to check that the .dta or other data files are identical
>>> using your operating system.
>>>
>>> Also, check out -cf- and -dta_equal-.
>>>
>>> Another way to approach this is to -append- the datasets and look for
>>> -duplicates-. However, -duplicates- just looks for duplicate
>>> observations. In principle, the variable names, variable labels, value
>>> labels, formats and characteristics must also be shown to be
>>> identical.
>>>
>>> To do this last, you will need to create a dataset identifier so that
>>> you can work out where any anomalies are.
>>>
>>> Here is an example where by construction the interesting part of the
>>> data is identical. So, -duplicates- confirms that everything occurs
>>> twice. Conversely, mismatches would imply singletons, triplicates,
>>> etc.
>>>
>>> . sysuse auto
>>> (1978 Automobile Data)
>>>
>>> . gen ds = 1
>>>
>>> . save auto1
>>> file auto1.dta saved
>>>
>>> . sysuse auto, clear
>>> (1978 Automobile Data)
>>>
>>> . gen ds = 2
>>>
>>> . append using auto1
>>> (label origin already defined)
>>>
>>>
>>> . tab ds
>>>
>>> ds | Freq. Percent Cum.
>>> ------------+-----------------------------------
>>> 1 | 74 50.00 50.00
>>> 2 | 74 50.00 100.00
>>> ------------+-----------------------------------
>>> Total | 148 100.00
>>>
>>> . duplicates report make-foreign
>>>
>>> Duplicates in terms of make price mpg rep78 headroom trunk weight
>>> length turn displacement
>>> gear_ratio foreign
>>>
>>> --------------------------------------
>>> copies | observations surplus
>>> ----------+---------------------------
>>> 2 | 148 74
>>> --------------------------------------
>>>
>>> Nick
>>>
>>> On Wed, Mar 2, 2011 at 9:01 AM, Rajaram Subramanian Potty
>>> <[email protected]> wrote:
>>>
>>>> We are carried out a survey and the data from the survey was entered
>>>> two times. Now, we want to compare these two data files for possible
>>>> data etnry errors. Please, inform how to compare the two data files
>>>> and identify the data entry error using stata.
>>> *
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/