In addition, consider the following trick:
gen first = cond(name1 < name2, name1, name2)
gen second = cond(name1 < name2, name2, name1)
duplicates <whatever> first second
That doesn't fix any issues with spelling (wide sense, i.e. case or
leading or trailing or embedded spaces), but it addresses the A and B =
B and A detail.
Nick
[email protected]
Joseph Coveney
Jennifer Nicoll Victor wrote:
Thank you Nick, for recommending the reshape command to me last week. I
now
have converted my UCINET relational dataset into dyads in Stata.
However, I
now have the problem of duplicate observations. My data are
non-directional
so the pair A-B is the same as the pair B-A. I need to efficiently
delete
the duplicates. I need only the unique observations, where the unit of
analysis is a pair. Can someone help?
Essentially, I have...
ID1 ID2 name1 name2 ...
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 1 Jones, Abby Smith, John
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 1 White, Rich Smith, John
3 2 White, Rich Jones, Abby
3 4 White, Rich Black, Kelly
4 1 Black, Kelly Smith, John
4 2 Black, Kelly Jones, Abby
4 3 Black, Kelly White, Rich
And I need to have....
ID1 ID2 name1 name2 ...
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 4 White, Rich Black, Kelly
But I have 191,406 pairs.
------------------------------------------------------------------------
--------
The do-file below gets what you want. Sorting 200 000 observations took
1.01 seconds on my laptop, so if the approach below takes a few moments
on
your dataset, then it's probably to do with the -min()- and -max()-.
You
also might be able to avoid the situation by doing something
pre-emptively
upstream.
Joseph Coveney
clear *
set more off
input byte ID1 byte ID2 str10 name1 str1 comma1 str10 name2 str10 name3
str1
comma2 str10 name4
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 1 Jones, Abby Smith, John
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 1 White, Rich Smith, John
3 2 White, Rich Jones, Abby
3 4 White, Rich Black, Kelly
4 1 Black, Kelly Smith, John
4 2 Black, Kelly Jones, Abby
4 3 Black, Kelly White, Rich
end
replace name1 = name1 + ", " + name2
replace name2 = name3 + ", " + name4
keep ID* name1 name2
format name* %-`=max(length(name1), length(name2))'s
*
* Begin here
*
generate str dyad_id = string(min(ID1, ID2)) + "-" + string(max(ID1,
ID2))
bysort dyad_id: keep if _n == 1
list, noobs separator(0)
exit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/