Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: neighbourhood size


From   Sergiy Radyakin <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: neighbourhood size
Date   Thu, 25 Jul 2013 10:04:39 -0400

On Wed, Jul 24, 2013 at 10:31 PM, James Beard <[email protected]> wrote:
> If there is some compelling reason for doing this in Stata, have a
> look at -strgroup- and -levenshtein- (-findit strgroup-).
>
> Of course, you may still have the unicode problem. If the actual
> content of the text in unimportant, and the text is all in the same
> unicode block, you may be able to pre-process your text to turn it
> into (potentially meaningless) 8 bit characters.

In which case you would need to know, how many of them make one letter
in the original text, as e.g in UTF8 a character can be represented by
1, 2, 3 or more bytes, since the distance is formulated in terms of
letters, not bytes. With Unicode even partitioning text into words is
not trivial, and different programs can do it differently (there were
recent changes in .NET 3.5 for example in these procedures. e.g. how
would you treat a MONGOLIAN VOWEL SEPARATOR? ZERO WIDTH SPACE? etc.,
see here for details:
http://msdn.microsoft.com/en-us/library/t97s7bs3.aspx)

for algorithms in English see also SOUNDEX implementation for Stata,
(help soundex).

Best,Sergiy


>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> On 24 Jul 2013 at 22:12, Sergiy Radyakin wrote:
>
> Date sent:              Wed, 24 Jul 2013 22:12:47 -0400
> Subject:                Re: st: neighbourhood size
> From:                   Sergiy Radyakin <[email protected]>
> To:                     "[email protected]"
> <[email protected]>
> Send reply to:          [email protected]
>
> Does not sound like a big deal, except that Stata does not work with
> unicode. However even in English you will need to decide how to deal
> with ambiguities in the text. Suppose your dictionary is greek
> letters: alpha, beta,... you encounter 'opsilon' in the text, do you
> increment the frequency of 'epsilon'? 'upsilon'? both (according to
> your definition)? or none? (this is not a valid word but a typo) Once
> you resolve that: for i=1 { for j=1 {...}} A couple of loops should
> suffice. Now that can be slow, so then you investigate what special
> is
> known about your word list, what special is known about your text,
> and
> what is acceptable in terms of performance. A lot depends on the size
> of the corpus. If you say it is a page of google search results - we
> are ok. If it is the contents of JSTOR for the last 20 years, we
> might
> be in trouble. What is the size of the word list? is it two three ten
> keywords? or is it the contents of a novel?
>
> Why is Stata picked as a tool for solving this problem I wonder?
> http://stackoverflow.com/questions/4520876/counting-the-frequency-of-
> specific-words-in-text-file
>
> Sergiy
>
>
> On Wed, Jul 24, 2013 at 8:45 PM, Mehdi Bakhtiar <[email protected]>
> wrote:
>>>> Dear Experts,
>>>> I have a question about how to use stata to calculate
> neighbourhood size for a list of my words. Basically, I have my own
> word list and a corpus.   I need to tell stata to count the number of
> neighbours of each word in my wordlist (words with one letter
> variation)  out of my corpus. Also, I need to mention that my words
> are in Persian script.
>>>> In advance many thanks for any attention and support,
>>>> Kind regards,
>>>> Mehdi Bakhtiar
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index