Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: neighbourhood size
From
"James Beard" <[email protected]>
To
[email protected]
Subject
Re: st: neighbourhood size
Date
Thu, 25 Jul 2013 02:31:03 -0000
If there is some compelling reason for doing this in Stata, have a
look at -strgroup- and -levenshtein- (-findit strgroup-).
Of course, you may still have the unicode problem. If the actual
content of the text in unimportant, and the text is all in the same
unicode block, you may be able to pre-process your text to turn it
into (potentially meaningless) 8 bit characters.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On 24 Jul 2013 at 22:12, Sergiy Radyakin wrote:
Date sent: Wed, 24 Jul 2013 22:12:47 -0400
Subject: Re: st: neighbourhood size
From: Sergiy Radyakin <[email protected]>
To: "[email protected]"
<[email protected]>
Send reply to: [email protected]
Does not sound like a big deal, except that Stata does not work with
unicode. However even in English you will need to decide how to deal
with ambiguities in the text. Suppose your dictionary is greek
letters: alpha, beta,... you encounter 'opsilon' in the text, do you
increment the frequency of 'epsilon'? 'upsilon'? both (according to
your definition)? or none? (this is not a valid word but a typo) Once
you resolve that: for i=1 { for j=1 {...}} A couple of loops should
suffice. Now that can be slow, so then you investigate what special
is
known about your word list, what special is known about your text,
and
what is acceptable in terms of performance. A lot depends on the size
of the corpus. If you say it is a page of google search results - we
are ok. If it is the contents of JSTOR for the last 20 years, we
might
be in trouble. What is the size of the word list? is it two three ten
keywords? or is it the contents of a novel?
Why is Stata picked as a tool for solving this problem I wonder?
http://stackoverflow.com/questions/4520876/counting-the-frequency-of-
specific-words-in-text-file
Sergiy
On Wed, Jul 24, 2013 at 8:45 PM, Mehdi Bakhtiar <[email protected]>
wrote:
>>> Dear Experts,
>>> I have a question about how to use stata to calculate
neighbourhood size for a list of my words. Basically, I have my own
word list and a corpus. I need to tell stata to count the number of
neighbours of each word in my wordlist (words with one letter
variation) out of my corpus. Also, I need to mention that my words
are in Persian script.
>>> In advance many thanks for any attention and support,
>>> Kind regards,
>>> Mehdi Bakhtiar
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/