This is a very clear and useful statement of procedure.
As I understand it, using Excel in steps 2 and 3 is in essence a matter
of personal taste or convenience. Setting aside the facts that many
Stata users do not have Excel on their machines or prefer strongly never
to use it, it is entirely possible to stay within Stata.
Let me make that explicit by editing Jeph's text:
1. Create a list of all the text names
. u mydatafile, clear
. bys teamname: keep if _n==1
. keep teamname
2. Open the Data Editor
. edit teamname
and add a second variable -code-. In the second variable, give the
same number for every observation that represents the same team.
This is slightly time-consuming, but as they are sorted
on -teamname- it should just take a few minutes. One
advantage here is you can always recheck your codings.
3. Leave the Editor and -save- the dataset
. save codes, replace
4. Now you have a file (codes.dta) which has a list of the team names
in all their variety, and you can merge this into your original
file to assign a common code to every team that is the same
. u mydatafile, clear
. sort teamname
. merge teamname using codes
5. Now you can identify the teams by their codes; if you want you can
assign a common name
. bys code: replace teamname=teamname[_n-1] if _n>1
Nick
[email protected]
Jeph Herrin
I faced the problem recently with hospital data; we only had hospital
names, and needed to identify the hospitals uniquely for the analysis.
I also had the same problem once before with drug info. Here's what I
did both times, translated to your situation (if your variable with
teamnames is -teamname-); it may work for you.
1. create a list of all the text names
. u mydatafile, clear
. bys teamname: keep if _n==1
. outsheet teamname using teamnames.csv, replace
2. read this file teamnames.csv into Excel and add a
second column -code-. in the second column, give the
same number to every row that represents the same team.
this is slightly time consuming, but if they are sorted
on -teamname- it will just take a few minutes. one
advantage here is you can always recheck your codings.
3. export this from excel as codes.csv; read it into Stata
and sort on teamname
. insheet codes.csv, clear
. compress
. sort teamname
. save codes, replace
4. Now you have a file (codes.dta) which has a list of the team names
in all their variety, and you can merge this into your original
file to assign a common code to every team that is the same
. u mydatafile, clear
. sort teamname
. merge teamname using codes
5. Now you can identify the teams by their codes; if you want you can
assign a common name
. bys code: replace teamname=teamname[_n-1] if _n>1
Rufus Peabody wrote:
> I'm working with a dataset that contains a few variable containing the
> name of different college football teams. The problem is, they are
not
> spelled consistently (i.e. Miami(FL) and Miami Florida; USC and
Southern
> Cal). In many cases the spelling differs only in that there is an
extra
> space after the school name for some. What I'd like to do (and I'm
> pretty sure is possible) is create a master file with all the school
> names and possible spellings, which I can then somehow merge with my
> original dataset (and any future datasets with these teams) to create
a
> consistent spelling. How do I go about doing this? Specifically, if I
> have, say three variables containing spelling 1, spelling 2, and
> spelling 3 of a school, and I want to use spelling 1 in another
dataset,
> how can I merge with a variable that has some schools with spellling 1
> and others with spelling 2 or 3?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/