Friedrich,
my idea would be to first group text entries while ignoring the
capitalization, then count the occurence within these groups of each
entry with respect to capitalization and finally sort within each
group by occurence count and create a new variable which holds the
most common spelling. In case of a tie its somewhat random what
spelling will be chosen, it would be up to you to introduce some
further sort criterium.
My Stata solution would look like the follwowing:
clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
tempvar lotext
tempvar textgrp
tempvar comspelling
gen `lotext'=lower(text)
bys `lotext': gen `textgrp'=1 if _n==1
replace `textgrp'=sum(`textgrp')
bys `lotext' text: gen `comspelling'=_N
bys `lotext' `comspelling': gen newtext=text[_N]
I bet there are more elegant ways out in the wild and I am just
looking forward to learn about them.
Regards
Sebastian
On 2/22/07, Friedrich Huebler <[email protected]> wrote:
My data has string variables with text in uppercase or lowercase
letters. I would like to replace observations that are identical once
capitalization is ignored (e.g., "TEXT" and "text") by the most
common spelling. In some cases there are ties. So far I have only
managed to replace all such observations by their lowercase variant,
as in the example below. I am stumped and would appreciate any advice
on how I should proceed. I use Stata 8.2.
Friedrich Huebler
clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
count
local n = r(N)
forvalues i = 1/`n' {
local t = lower(text[`i'])
replace text = "`t'" if lower(text) == "`t'"
}
____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/