|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: data management issue (names listed differently)
Rufus,
I like Eva's first advice. It isn't a matter of how many observations
you have; it's a matter of how many spellings there are. The
advantage of -inlist- is that when you discover a new spelling, you
can correct it in your code, whereas if you use a separate file you
have to open, change, and close the file before -merge-ing. In my own
experience, I find that a do-file provides better documentation than a
separate file.
Dave
====================================
David C. Bell
Professor of Sociology
Indiana University Purdue University Indianapolis (IUPUI)
(317) 278-1336
====================================
On Jul 2, 2008, at 12:09 PM, Rufus Peabody wrote:
Eva,
Much thanks for the advice. I am still wondering how I can merge with
a variable that has a mixture of CorrectSpelling and WrongSpelling.
Cleaning it up manually is extremely time-consuming since there are
thousands of observations.
Thanks,
Rufus
On Jul 2, 2008, at 8:42 AM, Eva Poen wrote:
Rufus,
are there too many schools/spellings to do it manually (i.e. -replace
school = "USC" if inlist(school, "Southern Cal","SouthCal")- )?
In any case, I would recommend that you clean up your school variable
to make your task as easy as possible. That includes stripping of
leading/trailling blanks using -trim()-, and converting everything to
lower case (-lower()-). -itrim()- will reduce multiple, consecutive
internal blanks to one for you. All of this will help in reducing the
number of replacements you have to do.
As a general strategy, you could compile a list (or data set) of all
the spellings you have, after cleaning up. If you go for a data set,
it could have two variables, CorrectSpelling and WrongSpelling. It
should then be possible to use -merge- to add the correct spelling to
data sets where the wrong spelling is present. For this to work you
need to make sure that there are no ambiguous wrong spellings, i.e.
abbreviations that may relate to more than one school.
Hope this helps,
Eva
2008/7/2 Rufus Peabody <[email protected]>:
Hey all,
I'm working with a dataset that contains a few variable containing
the name
of different college football teams. The problem is, they are not
spelled
consistently (i.e. Miami(FL) and Miami Florida; USC and Southern
Cal). In
many cases the spelling differs only in that there is an extra
space after
the school name for some. What I'd like to do (and I'm pretty sure
is
possible) is create a master file with all the school names and
possible
spellings, which I can then somehow merge with my original dataset
(and any
future datasets with these teams) to create a consistent spelling.
How do I
go about doing this? Specifically, if I have, say three variables
containing
spelling 1, spelling 2, and spelling 3 of a school, and I want to
use
spelling 1 in another dataset, how can I merge with a variable that
has some
schools with spellling 1 and others with spelling 2 or 3?
Thanks a lot,
Rufus
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/