| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Re: Combine uppercase and lowercase text
Hi,
if you want to replace whole observations to a common spelling it is a
couple of minutes to implement (things are a bit tuffer if you have to go
word-by-word in every observation).
The general schema is:
1. Backup your data
2. Keep only this variable of interest
3. Create a frequency list. Use -contract- with the option -freq(n)- to
create a new variable n, which shows how often you have this spelling. E.g.:
contract text,freq(n)
+------+
| var1 |
|------|
1. | Text |
2. | Text |
3. | text |
4. | text |
5. | TEXT |
+------+
. contract var1,freq(n)
+----------+
| var1 n |
|----------|
1. | TEXT 1 |
2. | Text 2 |
3. | text 2 |
+----------+
4. Now you have a frequency list. Create a new variable with a low-case
spelling:
gen text_low=lower(text)
This variable will determine a dictionary-entry group with different
spellings of the same word (in your case it can be several words).
5. sort by group and frequency
sort text_low n
+---------------------+
| text n text_low |
|---------------------|
1. | TEXT 1 text |
2. | Text 2 text |
3. | text 2 text |
+---------------------+
6. Now you can go by group and assign a representative spelling in each
group (in my listing there is only 1 group "text"):
. by text_low : gen spelling=text[_N]
. l
+--------------------------------+
| text n text_low spelling |
|--------------------------------|
1. | TEXT 1 text text |
2. | Text 2 text text |
3. | text 2 text text |
+--------------------------------+
7. It just happened to be that spelling is equal to text_low here, need not
be always like that.
Drop n and text_low
8. Now you have a dictionary which translates "TEXT" "Text" and "text" into
"text".
9. Sort this data by text and save.
10 Get your original data back.
11 Merge the two datasets by the text variable.
12. Done
Things get a bit more complicated if you want to go word-by-word. Then you
create a full list of all words going observation-by-observation in a cycle,
and for each observation in a word-by-word cycle. Then you process this list
as above to get a translation dictionary. You can't merge the two datasets
anymore (unless you have a very limited dictionary, where you can create all
possible "sentences" first). So you will have to go a double-cycle again
(obs-by-obs, and word-by-word) looking for each word in the dictionary.
If the results are to be displayed to a human reader, it sometimes irritates
if one sees Tokyo, new york, MOSCOW.
So even if these were the most common spellings in the original data, one
would still prefer: Tokyo, New York, Moscow.
You might want to interface with Google or any other online reference to try
to guess, what the spelling is (will take an incredible amount of time for a
large dataset). Alternatively get a large local dictionary file, and try a
search there. Google gives a plenty. One easily obtainable is here:
http://wordlist.sourceforge.net/
Best regards, Sergiy
----- Original Message -----
From: "Friedrich Huebler" <[email protected]>
To: <[email protected]>
Sent: Thursday, February 22, 2007 1:15 AM
Subject: st: Combine uppercase and lowercase text
My data has string variables with text in uppercase or lowercase
letters. I would like to replace observations that are identical once
capitalization is ignored (e.g., "TEXT" and "text") by the most
common spelling. In some cases there are ties. So far I have only
managed to replace all such observations by their lowercase variant,
as in the example below. I am stumped and would appreciate any advice
on how I should proceed. I use Stata 8.2.
Friedrich Huebler
clear
gen str15 text = ""
input
"some text"
"Some Text"
"SOME TEXT"
"some other text"
"some other text"
"Some other text"
"Some other text"
"SoMe TeXt"
"SoMe TeXt"
"Some Other Text"
end
count
local n = r(N)
forvalues i = 1/`n' {
local t = lower(text[`i'])
replace text = "`t'" if lower(text) == "`t'"
}
____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/