[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: catching typos

From	Sergiy Radyakin <serjradyakin@gmail.com>
To	statalist@hsphsun2.harvard.edu
Subject	Re: st: Re: catching typos
Date	Wed, 7 Oct 2009 16:24:42 -0400

Dear Matthias,

this will be the topic of my presentation at the Canadian Stata User
Group Meeting in Toronto in a couple of weeks:

"Data cleaning in Stata using internet search engines"

I will show how Google can be used to catch just the typos that
you describe.

The registration is still open until Oct. 16th.

More information here:
http://www.stata.com/meeting/canada09/

Best regards,
     Sergiy Radyakin


On Tue, Oct 6, 2009 at 2:53 PM, Matthias Wasser
<matthias.wasser@gmail.com> wrote:
> I'm working with a dataset of several million observations identified
> by, among other things, string variables. I have a list against which
> I check these to determine if they belong to a certain category. So
> far, so good.
>
> What I would like to do is catch typos, so that "Republic of Frrance"
> gets caught by "Republic of France" or whatever. Simon Moore had a
> similar request
> (http://www.stata.com/statalist/archive/2008-08/msg00467.html); like
> him, I occassionally have multiple words per string, but the kind
> responses to his post assume (if I read them correctly) that there are
> just a few likely substitutions, while I have a couple hundred "red
> lion" equivalents and no idea of what the likely typos for them are.
> The Giuliano code might work, though, even if I don't understand its
> internals. Is Levenshtein distance generally considered the best way
> to search for typos? What edit distance is generally considered
> appropriate?
>
> Thanks so much in advance.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Re: catching typos
  - From: Matthias Wasser <matthias.wasser@gmail.com>

Prev by Date: Re: st: identifyng records in a group that have a field with nearby values from another group
Next by Date: RE: st: RE: st: C-statistic with -gologit2-
Previous by thread: st: RE: Re: catching typos
Next by thread: st: Summing across variables
Index(es):
- Date
- Thread