[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Combine uppercase and lowercase text

From	"Sebastian F. B�chte" <[email protected]>
To	[email protected]
Subject	Re: st: Combine uppercase and lowercase text
Date	Thu, 22 Feb 2007 08:27:10 +0100

Friedrich,

my idea would be to first group text entries while ignoring the
capitalization, then count the occurence within these groups of each
entry with respect to capitalization and finally sort within each
group by occurence count and create a new variable which holds the
most common spelling. In case of a tie its somewhat random what
spelling will be chosen, it would be up to you to introduce some
further sort criterium.

My Stata solution would look like the follwowing:

clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
tempvar lotext
tempvar textgrp
tempvar comspelling

gen `lotext'=lower(text)
bys `lotext': gen `textgrp'=1 if _n==1
replace `textgrp'=sum(`textgrp')

bys `lotext' text: gen `comspelling'=_N
bys `lotext' `comspelling': gen newtext=text[_N]

I bet there are more elegant ways out in the wild and I am just
looking forward to learn about them.

Regards
Sebastian


On 2/22/07, Friedrich Huebler <[email protected]> wrote:

My data has string variables with text in uppercase or lowercase
letters. I would like to replace observations that are identical once
capitalization is ignored (e.g., "TEXT" and "text") by the most
common spelling. In some cases there are ties. So far I have only
managed to replace all such observations by their lowercase variant,
as in the example below. I am stumped and would appreciate any advice
on how I should proceed. I use Stata 8.2.

Friedrich Huebler

clear
gen str15 text = ""
input
 "some text"
 "Some Text"
 "SOME TEXT"
 "some other text"
 "some other text"
 "Some other text"
 "Some other text"
 "SoMe TeXt"
 "SoMe TeXt"
 "Some Other Text"
end
count
local n = r(N)
forvalues i = 1/`n' {
 local t = lower(text[`i'])
 replace text = "`t'" if lower(text) == "`t'"
}







____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Combine uppercase and lowercase text
  - From: Friedrich Huebler <[email protected]>

Prev by Date: st: A question on programming
Next by Date: RE: st: Using ODBC
Previous by thread: st: Combine uppercase and lowercase text
Next by thread: st: Re: Combine uppercase and lowercase text
Index(es):
- Date
- Thread