Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Turning text pages into indicators
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Turning text pages into indicators
Date
Wed, 8 Aug 2012 17:03:16 +0100
1.
gen indicator = strpos(string, "word1") | strpos(string, "word2") |
strpos(string, "word3")
2.
Mata offers support for longer strings. Otherwise, I'd think in terms
of lines = observations, pages = blocks of observations.
If you run code like that above with a structure of three variables:
page line text
1 1 "Once upon a time there was a cat who liked statistics, "
1 2 "and her favourite program was called Stata. She just loved"
1 3 "Stata and thought it was purrfect."
2 1 "The cat knew a big bad wolf who didn't like Stata."
2 2 "The wolf used SAS, Scary Animal Software."
Then you can go
gen lineindicator = strpos(text, "Stata") | strpos(text, "SAS")
egen pageindicator = max(lineindicator), by(page)
Nick
On Wed, Aug 8, 2012 at 2:01 PM, Jen Zhen <[email protected]> wrote:
> (1) I'd like to create a list of indicators to cover whether a string
> variable contains at least one out of several words.
> I know I can check whether it contains one specific word with - gen
> indicator=regexm(string,"word1") - but can I also cover several words
> in one command line with this?
> I tried - gen indicator=regexm(string,"word1" "word2") - and gen
> indicator=regexm(string,"word1" | "word2") - and these wouldn't work,
> but maybe there's another way to do this?
> I know I can as well generate a separate indicator for each word and
> then just sum them up, but since I have many words and many strings to
> cover that would be inefficient.
>
> (2) I'm starting with long texts, think half a page or a full page, so
> I presumably can't read the entire page into a single string variable
> on which I can then perform (1) above.
> Do I need to initially split the text in say Excel, or is there a way
> to still read all text in in Stata and then split it into as many
> variables as necessary (but no more)?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/