|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: RE: testing -duplicates tag-
Dear Martin,
This works with perfect sensitivity and
specificity. For those interested, the original
question was:
for each record in one group (foreign), tag for
deletion all records in another group (domestic),
which are duplicates on a set of specified
variables (headroom and trunk).
My goal in asking for help with this argument was
to have a method for removing potential
duplicates between two overlapping data sets,
where individual identifiers are not available.
My sincere thanks to Martin, Nick, Eva and
Emmanouil for their very kind, and helpful, input.
Michael
Ok, so let`s try that again. The tag should now reliably indicate that an
observation is duplicated more times overall than in the domestic subgroup,
implying that it must have at least one match in the foreign group...
*********
sysuse auto, clear
g id=_n
duplicates tag headroom trunk if foreign==0, generate(dupdom)
duplicates tag headroom trunk, generate(dupall)
*tag to indicate domestic obs with at least one match in foreign
g byte tag = for==0 & dupall>dupdom
*let�s see
l tag id f if for==0, noo h(25)
*********
HTH
Martin
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael McCulloch
Sent: Thursday, September 04, 2008 6:30 AM
To: [email protected]
Subject: Re: st: RE: testing -duplicates tag-
The code suggested by Martin gets me closer, but the pattern is still
not exclusive. I'm trying to identify observations in DOMESTIC, which
are duplicates (in headroom & trunk) of observations in FOREIGN. Here
are two sets of those duplicates. Note how 20 is a duplicate of 57,
where the patterns of missing and 0 in dupfor and dupdom seem to form
a pattern; that pattern, however is contradicted in the next set,
where 53 71 and 72 are duplicates of 32.
Any ideas would be appreciated!
id foreign headroom trunk dupall dupfor dupdom
20 Domestic 2 8 1 . 0
57 Foreign 2 8 1 0 .
* * * * * * *
32 Domestic 3 15 3 . 0
53 Foreign 3 15 3 2 .
71 Foreign 3 15 3 2 .
72 Foreign 3 15 3 2 .
Try this:
sysuse auto, clear
duplicates tag headroom trunk if foreign==1, generate(dupfor)
*duplicates tag headroom trunk if foreign==0, generate(dupdom)
duplicates tag headroom trunk, generate(dupall)
l if dupfor==0 & dupall>0
HTH
Martin
Quoting Michael McCulloch <[email protected]>:
On other question, if I may:
How would I modify the list command as re-written below, to identify
only those duplicates where:
headroom and trunks are duplicated, but
foreign is not,
so that I could find only those Foreign cars who have duplicates in the
set of Domestic cars (in this case observations #7 and #8)?
clear
sysuse auto
list foreign headroom trunk
duplicates tag headroom trunk, generate(dup)
sort headroom trunk
list foreign headroom trunk dup if dup>0 & trunk==8, clean noobs
Well, as -help duplicates- shows, a -varlist- is allowed with all
of the fice commands. If you had the *OR* operator, this would be
pointless. -duplicates tag- watches out for unique combinations of
the variables in your -varlist- and then tags with the number of
other observations sharing this unique combination.
sysuse auto, clear
duplicates tag head mpg, gen(dup)
duplicates report headroom mpg
ta dup
duplicates tag head mpg tru, gen(dup1)
duplicates report headroom mpg tru
ta dup1
HTH
Martin
Quoting Michael McCulloch <[email protected]>:
Thanks Martin. Am I correct in understanding that, in this revised
example immediately below, the command:
. duplicates tag headroom trunk, generate(dup)
would tag as dup>0 all sets of observations for which there are
duplicates of:
headroom *AND* trunk
and not just those for which there are duplicates of:
headroom *OR* trunk
?
It looks that way on visual inspection of this example's output, but I
>>>>want to make sure before applying it to my much larger dataset.
clear
sysuse auto
list foreign headroom trunk
duplicates tag headroom trunk, generate(dup)
sort headroom trunk
list foreign headroom trunk dup if dup>0, clean
Michael
Well, the question is not much clearer now, at least to me. I
suspect you want something like
count if duptag > 0
after your commands. Just replace duptag with the tag used by
Stata and be aware that two observations sharing the same
covariate pattern would each be counted twice (58 and 59 would
both count under this rule). If that is not what you want,
clarify!
HTH
Martin
Quoting Michael McCulloch <[email protected]>:
Apologies, I wasn't clear in my question. What I want to do is find
records for which *both* trunk and headroom are duplicates. So
following the command suggested by Martin and Nick, I get:
. list foreign headroom trunk if trunk==8, clean
foreign headroom trunk 20. Domestic 2.0 8
45. Domestic 1.5 8 57. Foreign 2.0 8
58. Foreign 2.5 8 59. Foreign 2.5 8
Note that:
observations 20 and 57 both have headroom==2.0, trunk==8
observations 58 and 59 both have headroom==2.5, trunk==8
Since I'm developing this command for use in a large dataset, how
would
I follow up -duplicates tag- to identify those unique sets of records,
where two variables are duplicates simultaneously, without having to
search manually?
I cannot see your point. Stata does tag these observations
with tag 1. Just
-list- after -duplicates tag-.
**********
clear
sysuse auto
list foreign headroom trunk if trunk==8
duplicates tag headroom trunk, generate(dup_admission_id)
*Let`s see...
list dup_* foreign headroom trunk if trunk==8
**********
HTH
Martin
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
Michael McCulloch
Sent: Wednesday, September 03, 2008 6:29 PM
To: Statalist
Subject: st: testing -duplicates tag-
Hello,
I'm testing -duplicates tag-, and puzzled as to why it won't show the
two observations where headroom==2.0 and trunk==8.
clear
sysuse auto
list foreign headroom trunk if trunk==8
duplicates tag headroom trunk, generate(dup_admission_id)
--
Best wishes,
Michael McCulloch
Pine Street Foundation
124 Pine St., San Anselmo, CA 94960-2674
Tel: (415) 407-1357
Fax: (415) 485-1065
[email protected]
www.pinestreetfoundation.org
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
>>>* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/