Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Twitter Message Sub-string Extraction?
From
Richard Fairbanks <[email protected]>
To
[email protected]
Subject
Re: st: Twitter Message Sub-string Extraction?
Date
Mon, 13 Jun 2011 15:56:39 -0400
Dear Statlisters (especially Eric and Nick),
Thanks for the help! You're life-savers.
I still have one issue due to another data element I thought I could
handle. The tweets also contain URLs. They're a mix of letters and
integers. Even using the same loops you provide doesn't work for them
(i.e., replacing things like "@" and "#" with "http:\\" won't pick out
the full web address.
Any advice?
Thanks
Richard Fairbanks
On Mon, Jun 13, 2011 at 3:15 AM, Nick Cox <[email protected]> wrote:
> A very minor refinement here is that
>
> ds tweet?
>
> foreach t in `r(varlist)' {
>
> could just be
>
> foreach t of var tweet? {
>
> as -foreach- is perfectly capable of coping with the wildcard.
>
> -dropmiss- is from SJ.
>
> Nick
>
> On Sun, Jun 12, 2011 at 9:31 PM, Eric Booth <[email protected]> wrote:
>
>> I'd use -split- to separate each twitter message (esp. since twitter messages are so short) and a combination of strpos() and subinstr() to find the elements you describe across the split tweets.
>> I've provided an example below -- you'll need to install -dropmiss- and -sortrows-(from SSC) via -findit- before running my example.
>>
>> *****************!
>> clear
>> inp str240(tweet)
>> "*@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT @SomeOtherPerson15 @test"
>> "*@somethingelse* I'm asdfasdf at #something! #summer2011 RT @asff15 @afafdf"
>> "*@anothername* I'm asdff t #test! #summer2011 @test15 @YetAnotherPerson"
>> end
>>
>> split tweet, p(" ")
>> ds tweet*
>>
>> **username
>> rename tweet1 username
>> replace username = subinstr(username, "*", "", .)
>> l username
>> **RT
>> g rt = 1 if strpos(tweet, "RT")
>> ta rt
>>
>> **topics & recipients:
>> ds tweet?
>> foreach v in topic recipient {
>> loc n = 1
>> foreach t in `r(varlist)' {
>> g `v'`n' = `t' if strpos(`t', "#")
>> loc `++n'
>> } //end t loop
>> } //end v loop
>> //get rid of empty vars:
>> drop tweet?? tweet?
>> order topic* recipient*
>>
>> sortrows topic* , replace missing
>> sortrows recipient* , replace missing
>> dropmiss, force
>>
>> **probably want to reshape at some pt:
>> g id = _n
>> order id
>> reshape long topic recipient , i(id) j(tweet_num)
>> compress
>> order id username rt topic reci
>>
>> *****************!
>
>
> On Jun 12, 2011, at 2:39 PM, Richard Fairbanks wrote:
>
>>> I'm preparing a dataset of ~ 2,000 tweets (Twitter messages) for social
>>> network analysis. I'm trying to track who tweeted to whom and the theme
>>> (hashtag) of the message.
>>>
>>> Observations of the single variable look like this.
>>>
>>> *@RndmUsername* I'm having a great time at #Ibiza! #summer2011 RT
>>> @SomeOtherPerson15 @YetAnotherPerson
>>>
>>> For those unfamiliar with Twitter:
>>>
>>> @[Name] - Username of the person sending the tweet. Must be 20 characters or
>>> less, including letters and / or integers in any position.
>>>
>>> RT - "re-tweet" - Think of this like an email "Forward" option for tweets.
>>> No help needed here, just making a dummy variable!
>>>
>>> #[Name] - "hashtag" - An arbitrary code in letters and integers specifying
>>> the topic or adding commentary
>>>
>>> Subsequent @[Name]s - These are people to whom the message is specifically
>>> directed.
>>>
>>> I know how to generate a new variable that contains the message sender
>>> (always the first string after the "@" character) using regular expressions,
>>> although there's probably a simpler way.
>>>
>>> How can I generate a new variable that contains #[Names] and @[Names] after
>>> the first case of a username or hashtag? (That is, using the example, I'm
>>> having trouble extracting #summer2011, @SomeOtherPerson15 and
>>> @YetAnotherPerson.
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/