Jennifer Nicoll Victor wrote:
Thank you Nick, for recommending the reshape command to me last week. I now
have converted my UCINET relational dataset into dyads in Stata. However, I
now have the problem of duplicate observations. My data are non-directional
so the pair A-B is the same as the pair B-A. I need to efficiently delete
the duplicates. I need only the unique observations, where the unit of
analysis is a pair. Can someone help?
Essentially, I have...
ID1 ID2 name1 name2 ...
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 1 Jones, Abby Smith, John
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 1 White, Rich Smith, John
3 2 White, Rich Jones, Abby
3 4 White, Rich Black, Kelly
4 1 Black, Kelly Smith, John
4 2 Black, Kelly Jones, Abby
4 3 Black, Kelly White, Rich
And I need to have....
ID1 ID2 name1 name2 ...
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 4 White, Rich Black, Kelly
But I have 191,406 pairs.
--------------------------------------------------------------------------------
The do-file below gets what you want. Sorting 200 000 observations took
1.01 seconds on my laptop, so if the approach below takes a few moments on
your dataset, then it's probably to do with the -min()- and -max()-. You
also might be able to avoid the situation by doing something pre-emptively
upstream.
Joseph Coveney
clear *
set more off
input byte ID1 byte ID2 str10 name1 str1 comma1 str10 name2 str10 name3 str1
comma2 str10 name4
1 2 Smith, John Jones, Abby
1 3 Smith, John White, Rich
1 4 Smith, John Black, Kelly
2 1 Jones, Abby Smith, John
2 3 Jones, Abby White, Rich
2 4 Jones, Abby Black, Kelly
3 1 White, Rich Smith, John
3 2 White, Rich Jones, Abby
3 4 White, Rich Black, Kelly
4 1 Black, Kelly Smith, John
4 2 Black, Kelly Jones, Abby
4 3 Black, Kelly White, Rich
end
replace name1 = name1 + ", " + name2
replace name2 = name3 + ", " + name4
keep ID* name1 name2
format name* %-`=max(length(name1), length(name2))'s
*
* Begin here
*
generate str dyad_id = string(min(ID1, ID2)) + "-" + string(max(ID1, ID2))
bysort dyad_id: keep if _n == 1
list, noobs separator(0)
exit
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/