Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: matching observations for merging
From
"Lachenbruch, Peter" <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
RE: st: matching observations for merging
Date
Thu, 17 Jun 2010 09:02:06 -0700
Almost - in a similar application, I frequently need to sort on physician name - so there may be a bunch of docs. Unfortunately, there is often no consistency - one time I may see (to use a Statalist contributor, who has never been one of these) WolfeF, WolfF, FWolfe, Fwolfe, etc. This doesn't account for misspellings and typos. The idea of sorting by name will go far, but with many names and no standardization of how to enter the name there's a lot of work to be done. Maarten's idea will be useful to many.
These are often studies from medical records, so there is limited control on spelling, etc.
Tony
Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maarten buis
Sent: Thursday, June 17, 2010 8:56 AM
To: [email protected]
Subject: Re: st: matching observations for merging
--- On Thu, 17/6/10, Abhimanyu Arora wrote:
> I have to files to be merged. Is it possible to merge using
> an approximation of the merging variable? In other words, if
> my merging variable is say, country, there could be a slight change in
> spelling of some countries (Afghanistan/ Afganistan) in the two
> files...Is there a more efficient way than just going through all 200+
> countries and checking spelling consistency?
For countries the quickest way is to
1) keep in each dataset one observation per country
2) merge the 2 datasets
3) keep if _merge != 3
4) sort on country name
5) list
This will display a list of troublesome country names, which is
usually so short that it doesn't pay to do anything more fancy.
With this list you can create a recode .do file which harmonizes
country names before the final merge.
Moreover, this harmonization do file can be a good starting position
in any subsequent project involving the merge on country names, as the
kind of inconsistencies in country names are pretty similar across
files. So at the begining of each project you start by running the
harmonization do-file of the last project, than go through steps 1-5
to find any mismatches that weren't handeld in the last do-file, and
add those to your new harmonization file. After 4 or 5 projects you
will hardly find any mismatch anymore.
Hope this helps,
Maarten
--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany
http://www.maartenbuis.nl
--------------------------
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/