Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Turning text pages into indicators |
Date | Wed, 8 Aug 2012 17:03:16 +0100 |
1. gen indicator = strpos(string, "word1") | strpos(string, "word2") | strpos(string, "word3") 2. Mata offers support for longer strings. Otherwise, I'd think in terms of lines = observations, pages = blocks of observations. If you run code like that above with a structure of three variables: page line text 1 1 "Once upon a time there was a cat who liked statistics, " 1 2 "and her favourite program was called Stata. She just loved" 1 3 "Stata and thought it was purrfect." 2 1 "The cat knew a big bad wolf who didn't like Stata." 2 2 "The wolf used SAS, Scary Animal Software." Then you can go gen lineindicator = strpos(text, "Stata") | strpos(text, "SAS") egen pageindicator = max(lineindicator), by(page) Nick On Wed, Aug 8, 2012 at 2:01 PM, Jen Zhen <jenzhen99@gmail.com> wrote: > (1) I'd like to create a list of indicators to cover whether a string > variable contains at least one out of several words. > I know I can check whether it contains one specific word with - gen > indicator=regexm(string,"word1") - but can I also cover several words > in one command line with this? > I tried - gen indicator=regexm(string,"word1" "word2") - and gen > indicator=regexm(string,"word1" | "word2") - and these wouldn't work, > but maybe there's another way to do this? > I know I can as well generate a separate indicator for each word and > then just sum them up, but since I have many words and many strings to > cover that would be inefficient. > > (2) I'm starting with long texts, think half a page or a full page, so > I presumably can't read the entire page into a single string variable > on which I can then perform (1) above. > Do I need to initially split the text in say Excel, or is there a way > to still read all text in in Stata and then split it into as many > variables as necessary (but no more)? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/