Sebastian's approach is mine too, but it can be
done a little more directly.
We agree that text is similar if the lower cased
version is identical.
gen lowercase = lower(text)
The frequencies of the other versions are
calculated by
bysort lowercase text : gen freq = _N
The most frequent version has the highest
value of -freq-. If you -sort- within
values of -lowercase- by -freq-, then the
most frequent value of -text- is at the end.
bysort lowercase (freq text) : gen mostfrequent = text[_N]
Here I am rather arbitrarily splitting ties.
Nick
[email protected]
Sebastian F. B�chte
> my idea would be to first group text entries while ignoring the
> capitalization, then count the occurence within these groups of each
> entry with respect to capitalization and finally sort within each
> group by occurence count and create a new variable which holds the
> most common spelling. In case of a tie its somewhat random what
> spelling will be chosen, it would be up to you to introduce some
> further sort criterium.
>
> My Stata solution would look like the follwowing:
>
> clear
> gen str15 text = ""
> input
> "some text"
> "Some Text"
> "SOME TEXT"
> "some other text"
> "some other text"
> "Some other text"
> "Some other text"
> "SoMe TeXt"
> "SoMe TeXt"
> "Some Other Text"
> end
> tempvar lotext
> tempvar textgrp
> tempvar comspelling
>
> gen `lotext'=lower(text)
> bys `lotext': gen `textgrp'=1 if _n==1
> replace `textgrp'=sum(`textgrp')
>
> bys `lotext' text: gen `comspelling'=_N
> bys `lotext' `comspelling': gen newtext=text[_N]
>
> I bet there are more elegant ways out in the wild and I am just
> looking forward to learn about them.
>
> Regards
> Sebastian
>
>
> On 2/22/07, Friedrich Huebler <[email protected]> wrote:
> > My data has string variables with text in uppercase or lowercase
> > letters. I would like to replace observations that are
> identical once
> > capitalization is ignored (e.g., "TEXT" and "text") by the most
> > common spelling. In some cases there are ties. So far I have only
> > managed to replace all such observations by their lowercase variant,
> > as in the example below. I am stumped and would appreciate
> any advice
> > on how I should proceed. I use Stata 8.2.
> >
> > Friedrich Huebler
> >
> > clear
> > gen str15 text = ""
> > input
> > "some text"
> > "Some Text"
> > "SOME TEXT"
> > "some other text"
> > "some other text"
> > "Some other text"
> > "Some other text"
> > "SoMe TeXt"
> > "SoMe TeXt"
> > "Some Other Text"
> > end
> > count
> > local n = r(N)
> > forvalues i = 1/`n' {
> > local t = lower(text[`i'])
> > replace text = "`t'" if lower(text) == "`t'"
> > }
> >
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/