Archive for the 'dissertation' category

Solution to my Twitter API - twitterR issues

With lots of help from Bob O’Hara (thank you!), I was able to solve my problems. I am looking at the tweets around #AGU10 but it occurred to me that I wanted to know what other tweets the AGU twitterers were sending while at the meeting because some might not have had the hashtag.

Here goes:

# Get the timeline
person <- userTimeline("person",n=500)

# Check to see how many you got
length(person)

# Check to see if that is far enough back
person[[500]]$getCreated()

# Get the time it was tweeted
Time = sapply(person,function(lst) lst$getCreated() )

# Get screen name
SN = sapply(person,function(lst) lst$getScreenName() )

# Get any reply to screen names
Rep2SN = sapply(person,function(lst) lst$getReplyToSN())

# Get the text
Text = sapply(person,function(lst) lst$getText())

# fix the date from number of seconds to a human readable format
TimeN <- as.POSIXct(Time,origin="1970-01-01", tz="UTC")

# replace the blanks with NA
Rep2SN.na <- sapply(Rep2SN, function(str) ifelse(length(str)==0, NA, str))

# make it into a matrix
Data.person <- data.frame(TimeN=TimeN, SN=SN, Rep2SN.na=Rep2SN.na, Text=Text)

# save it out to csv
write.csv(Data.person, file="person.csv")

 

So I did this by finding and replacing person with the screen name in a text editor and pasting that into the script window in Rcmdr. I found that 500 was rarely enough. Some I had to request up to 3200 tweets, which is the maximum. I had to skip one person because 3200 didn’t get me back to December. It’s also worth noting the length() step. It turns out that when you ask for 500 you sometimes get 550 and sometimes get 450 or anywhere in between and it’s not because there aren’t any more. You may also wonder why I wrote the whole thing out to a csv file. I could have had a step to cut out the more recent and older tweets to have just the set there for more operations within R. I need to actually do qualitative content analysis on the tweets and I plan to do that in NVIVO9.

I didn’t do this for all 860, either. I did it for the 30 or so who tweeted 15 or more times with the hashtag. I might expand that to 10 or more (17 more people). Also, I didn’t keep the organizational accounts (like theAGU).

With that said, it’s very tempting to paste all of these data frames together, remove the text and do the social network analysis using iGraph. Even cooler would be to show an automated display of how the social network changes over time. Are there new connections formed at the meeting (I hope so)? Do the connections formed at the meeting continue afterward? If I succumb to the temptation, I’ll let you know. There’s also the the textmining package and plugin for Rcmdr. This post gives an idea of what can be done with that.

Share

2 responses so far

My ongoing struggle with the Twitter API, R, … copy paste

I’m posting this in hopes that someone with experience in any/all of the above or maybe Perl, can point out that I’m doing something stupid or have overlooked something obvious. If nothing else, you might read this to see what not to try.

Here’s the issue: it’s totally obvious that I need to look at the other tweets that were sent by #agu10 tweeters (the ones not marked with the hash tag) if I want to understand how Twitter was used at the meeting. But it’s now five months later and there are 860 of them (although I would be fine with looking at the most prolific non-institutional tweeters).

I first looked at the Twitter API and I tried just adding terms to URLs and got the recent timelines for a user at a time but I couldn’t see a way to get a user’s timeline for a set period of time (the conference time period +a week on each end, or so).

I asked two experts and they both said that you couldn’t combine the user timeline with a time period.

Darn. So my next idea was to see if I could actually access someone’s timeline that far back through the regular interface. I tried one of the more prolific tweeters and I could. Ok, so if I can pull down all of their tweets, then I could pick out the ones I wanted. Or, even better, I could also look at the evolution of the social network over time. Did people meet at the meeting and then continue to tweet at each other or are these people only connected during the actual meeting?  Did the network exist in the same way before the meeting?

I was looking for ways to automate this a bit and I noticed that there were things already built for Perl and for R. I used Perl with a lot of handholding to get the commenter network for an earlier paper and I used R for both that same article and in lieu of STATA for my second semester of stats. I’m not completely comfortable with either one and I don’t always find the help helpful. I decided to start with R.

The main package is twitteR by Jeff Gentry. I updated my installation of R and installed and loaded that package and the dependencies. First thing I did was to get my own standard timeline:

testtweets <- userTimeline("cpikas")

Then I typed out the first few to see what I got (like when you’re using DIALOG)

testtweets[1:5]

And I saw my tweets in the format:

[[1]]

[1]”username: text”

I checked the length of that and got 18 – the current timeline was 18 items. I tried the same thing substituting user id but that didn’t work. So then I tried to retrieve 500 items and that worked fine, too.

testlonger <- userTimeline ("cpikas", n=500)

Great. Now, let me see the dates so I can cut off the ones I want. Hm. Ok, let’s see, how to get the other columns. What type of object is this anyhow? The manual is no help. I tried some things with object$field. No joy. Tried to edit. no joy – it was upset about the < in the image url. And it was also telling me that the object was of type S4. The manual said it wasn’t but I can’t argue if that’s what it’s reading. I somehow figured out it was a list. I tried object name [[1]][2]  - null. Then I eventually tried

str(testweets)

Hrumph. It says 1 slot. So as far as i can tell, it’s a deprecated object type and it didn’t retrieve or keep all of the other information needed to narrow by date.

When googling around, I ran across this entry by Heuristic Andrew on text mining twitter data with R. I didn’t try his method with the xml package yet (may try that). I did try the package that was listed in the comments tm.plugin.webcorpus by Mario Annau. That does get the whole tweet and put the things in slots the right way (object$author), but it looks like you can only do a word search. Oh wait, this just worked:

testTM <- getMeta.twitter('from:cpikas')

But that’s supposed to default to 100 per page, 100 things returned and it only returned 7 for me. I guess the next thing to try is the XML version unless someone reading this has a better idea?

edit: forgot the copy paste. When I tried to just look at the tweets i wanted on screen and then copy them into a text document it crashed firefox. who knows why

Share

15 responses so far

Another little teaser on AGU10 tweets: NASA

I’m just starting to analyze the tweets from the American Geophysical Union Fall Meeting from December. These are just the ones with the hashtag #agu10 that were kept in a TwapperKeeper account. The first of these teasers is in the previous post at: http://scientopia.org/blogs/christinaslisrant/2011/04/05/an-early-image-of-the-agu10-twitter-archive/

 

We saw in that last picture how many tweets were tweeted to or mentioning theAGU and NASA. I was wondering what the circumstances were. Turns out that 257 (at least) of the 264 tweets at or mentioning NASA were re-tweets of their press release tweets. In fact, there wasn’t a ton of diversity in what was retweeted.

Number of tweets Identifying phrase
113 “loss of ice” 
34 “April Mexico quake”
31 “mars opportunity”
27 “some big NASA science”
19 “creeping faults in Bay area”
18 “more NASA science”
8 “electric atmosphere” video
7 “how hard are we pushing”

Everyone was retweeting the same few stories. I suspect from the graph that these folks weren’t tweeting anything else from the meeting. Were they even there or are they just NASA fans?

Share

No responses yet

An early image of the AGU10 twitter archive

I used TwapperKeeper to capture the AGU10 twitter archive. TwapperKeeper via Summarizr gives some general stats but I was curious more about the connections. At first I thought I could take the from to columns directly from the export and put them into an SNA package, but alas, the to field only covered tweets that started with @. So that leaves out all of the RT@ messages as well as the mentions where the @ is somewhere embedded.  I was despairing a little bit about it, and even got ready to pull out the Perl and regex, but my dear husband was like why not do text to columns at the @ symbol. Well, why not indeed?  So this dataset only has one @ in it. If more than one person was @-ed, only the first is pulled out right now. I might do something different later.

Anyhow, so I took that and I pasted it into NodeXL – an add-in for Excel 2007 that does SNA. But I was sort of having trouble working the visualization – mostly my inexperience probably. So I exported from there in DL format, imported into UCInet and then opened in NetDraw. There’s lots to see and do yet, but I thought this little bit was interesting:

agu10 mentions replies largest component 781 sized in degreeThis is the same license as the rest of my blog (cc-by), but it’s just a first pass so you might want to keep that in mind if you want to redistribute.

This is the largest component (components are pieces of the graph that are connected to each other but not the rest of the graph). It has 781 nodes. The rest of the components are like 3-5 nodes on average. The nodes are sized by inDegree (how many people tweeted @ them with the agu10 hashtag). What I find interesting about this is the role of institutional bloggers. Only one of the labels is clear but the two largest nodes are NASA, top, and theAGU, bottom. The medium sized one above NASA is NASAjpl. It’s interesting about the institutional bloggers, but also that they really seem to cluster in two camps. Not that many people tweeted @ both.

Certainly, I’m curious about what’s in common with the people in one camp or the other and what the content of the messages is. But this is an extremely early look.

UPDATE: Upon further inspection it became clear that there was an issue with upper and lower case - Twitter isn't sensitive, but my SNA packages are. Nothing I've said above really changes, there are just additional nodes connected to NASA and theAGU.

Share

One response so far