Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Repeated names in a string variable, but some have typos. How to correct?
From
Lucas Ferreira Mation <[email protected]>
To
statalist <[email protected]>
Subject
Re: st: Repeated names in a string variable, but some have typos. How to correct?
Date
Mon, 14 Apr 2014 10:16:01 -0300
Thank you Dimitriy,
it works. Municipal codes are fine.
The actuall dataset is quite large (all street names in Brazil), so it
took 2,5 days to run on the server.
On Fri, Apr 4, 2014 at 3:26 PM, Dimitriy V. Masterov <[email protected]> wrote:
> Lucas,
>
> If I understood your problem, you can try something like this using
> Julian Reif's strgroup:
>
> ssc install strgroup
> bys city: strgroup street_name, gen(group) threshold(0.25)
>
> city street_name number~s group
> A Rua Santos Dumont 1200 1
> A Rua Santos Dummont 30 1
> A Rua Satos Dumont 3 1
> A Rua Bandim 60 2
> B Rua Pedro Alvares Cabral 4000 3
> B Rua Pedro Alvaers Cabral 3 3
> B Rue Pedro Alvares Cabral 1 3
> B Av. Pedro Alvares Cabral 20 3
> B Rua other 45 4
>
> This relies on the city name having a single correct spelling. If
> that's not the case, you can apply this strategy to the city name
> first. It won't work with nick names (Frisco for San Francisco, to
> give a US example).
>
> You will want to play around with the threshold to match it to you
> tolerance for different types of misclassification.
>
> DVM
>
> On Fri, Apr 4, 2014 at 8:29 AM, Lucas Ferreira Mation
> <[email protected]> wrote:
>> statalisters,
>>
>> I have a large addresses database, identifying street_names, street_number
>> and city, which I need to collapse by street_name and city. Because the
>> street_names can have some typos for some street_numbers, when I collapse
>> some streets appear duplicated within cities (see example bellow)
>> Duplicated street_names between cities would be OK.
>>
>> Is there a command to do some sort of probabilistic/fuzzy string comparison
>> among the rows of a string variable (similar to what reclink does but
>> with-in the variable)?
>>
>> The dataset is quite large, after collapsing I get 2.3 million
>> cit-street_name pairs. So I need a smart way to go about it.
>>
>>
>> *Example of the data after collapsing:
>> clear
>> input str1 city str24 street_name number_of_obs
>> "A" "Rua Santos Dumont" 1200
>> "A" "Rua Santos Dummont" 30
>> "A" "Rua Satos Dumont" 3
>> "A" "Rua Bandim" 60
>> "B" "Rua Pedro Alvares Cabral" 4000
>> "B" "Rua Pedro Alvaers Cabral" 3
>> "B" "Rue Pedro Alvares Cabral" 1
>> "B" "Av. Pedro Alvares Cabral" 20
>> "B" "Rua other" 45
>> end
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/