[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: string

From	"Vladimir Vakhitov" <[email protected]>
To	[email protected]
Subject	Re: st: string
Date	Mon, 17 Mar 2008 12:03:45 -0400

Dear Victor,

This is a familiar problem, I had it with country names. I had up to
20 "variations" of the same country name. Unfortunately I had to do a
lot of stuff manually.
First I made a dictionary of "clear" values I needed. In your case
this would a "clear" list of cities. Then I tabulated my "dirty"
variable and manually assigned a correspondence between each "dirty"
value and a "clean" value. Tabulation will let you to see most common
regularities. Then saved this into a separate file and merged the
"clean" variable to the original file.

I used -regexr- and -trim- families of string functions to look for
some regularities and trim blanks. See -help string_functions-

It is always a good practice to -clonevar- your variables

I hope it helps.
Vladimir


2008/3/17, Viktor Slavtchev <[email protected]>:
> Dear list,
>  I want to merge two files where the common variable is a string (names
>  of cities). However, there are non systematic differences in the notions.
>  For example, you can find: "Berlin" in the first file but " Berlin" in
>  the second. In other cases you can find "Rome" and "Roma,IT". Or "Paris,
>  FR" and "Paris/FR"
>  I was tot able to find any systematics in the notion. I have over 40.000
>  unique observations.
>  How can I search for substrings in Stata? For example, for "*Rom*", the
>  largest match between "Rome" and "Roma,IT".
>  I think this could help to solve some problems. Or does anybody know a
>  better way to deal with such kind of 'bad' data?
>  thanks
>  viktor
>  *
>  *   For searches and help try:
>  *   http://www.stata.com/support/faqs/res/findit.html
>  *   http://www.stata.com/support/statalist/faq
>  *   http://www.ats.ucla.edu/stat/stata/
>


-- 
__________________
Volodymyr Vakhitov
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: string
  - From: Viktor Slavtchev <[email protected]>

Prev by Date: Re: st: permutations
Next by Date: st: WHILE command, the end goal
Previous by thread: st: string
Next by thread: st: Re: string
Index(es):
- Date
- Thread