Archive for the 'social computing technologies' category

Using R TwitteR to Get User Information

I'm gonna keep stating the obvious, because this took me a few hours to figure out. Maybe not working continuously, but still.

So, I have like more than 6000 tweets from one year of AGU alone, so I'm gonna have to sample somehow. Talking this over with my advisor, he suggested that we have to find some reasonable way to stratify and then do random within the stratification. I haven't worked all the details out yet - or really any of them - but I started gathering user features I could base the decision on. Number of tweets with the hashtag was super quick in Excel. But I was wondering if they were new to Twitter, if they tweeted a lot, and if they had a lot of followers. That's all available through the api and using the TwitteR package by Jeff Gentry.  Cool.

So getUser() is the function to use. I made up a list of the unique usernames in Excel and imported that in. Then I went to loop through.


library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
 data USERdata<-vector()
userInfo<-function(USER){
 temp<-getUser(USER, cainfo="cacert.pem")
 USERdata<-c(USER,temp$created,temp$statusesCount,temp$followersCount)
 return(USERdata)
 }
 #test for users 4-6
 tweeter.info<-sapply(data$user[4:6],userInfo)

But that was sorta sideways... I had a column for each user... sorta weird. Bob O'H helped me figure out how to transpose that and I did, but it was sorta weird.

So then I tried this way:

get.USER.info<-function(startno,stopno){
# set up the vectors first
n<-stopno-startno+1
USER.name<-character(n)
USER.created<-numeric(n)
USER.posts<-numeric(n)
USER.foll<-numeric(n)
for (i in startno:stopno) {
thing<-getUser(data$user[i], cainfo="cacert.pem")
USER.name[i]<-data$user[i]
USER.created[i]<-thing$created
USER.posts[i]<-thing$statusesCount
USER.foll[i]<-thing$followersCount
}


return(data.frame(username=USER.name,created=USER.created, posts=USER.posts,followers=USER.foll, stringsAsFactors=FALSE))
}

So that was cool, until it wasn't. I mean, turns out that 2% of the users have deleted their accounts or block me or are private or something. So it didn't recover from that error and I tried to test for is.null() and is.NA() but it failed....
So then I went back to the mailing list and there was a suggestion to user try() but eek.
So then I noticed that if you have a pile to look through you're actually supposed to use
lookupUsers(users, includeNA=FALSE, ...)
And I did, and I wanted to keep the NA so that I could align with my other data later... but once again, no way to get the NAs out. And it's an object that's a pile of lists... which I was having trouble wrapping my little mind around (others have no issues).
So I went back and used that command again, and this time said to skip the NA (the not found users). Then I think from the mailing list or maybe from Stack Overflow? I had gotten the idea to use unlist. So here's what I did then:
easy.tweeters.noNA<-lookupUsers(data$user, cainfo="cacert.pem")
#check how many fewer this was
length(easy.tweeters.noNA)
#1247 so there were 29 accounts missing hrm
testbigdf<-data.frame()
for (i in 1:1247){holddf<-twListToDF(easy.tweeters.noNA[i])
testbigdf<-rbind(testbigdf,holddf)
}

And that created a lovely dataframe with all kinds of goodies for it. I guess I'll have to see what I want to do about the 29 accounts.

I really would have been happier if it was more graceful with users that weren't found.

Also, not for every single command you have to user the cainfo="cacert.pem" thingy... Every time, every command.

ALSO, I had figured out oauth, but the twitter address went from http:// to https:// and so that was broken, but I fixed it. I hope I don't have to reboot my computer soon! (Yeah, I saved my credentials to a file, but I don't know... )

No responses yet

Using R to expand Twitter URLS

So this should be so simple and obvious that it's not worth a post, but I keep forgetting how to do everything so I'm gonna put this here to remind myself.

Here's where I am. I have a list of 4011 tweets with #agu12 or #agu2012 hashtag. A lot of these are coding as "pointers" - their main function is to direct readers' attention somewhere else. So I got to wondering: where?  Are they directing people to online versions of the posters? Are they just linking to more NASA press releases?  % going to a .edu?

Of course all the URLs are shortened and there are services you can use to expand, but in R, it's already right there in the TwitteR package as

decode_short_url

This uses the longapi.org API . All you have to do is plug in the URL. Cool!

So here was my original plan: find the tweets with urls, extract the urls, expand them, profit! And I was going to do all this in R. But then it got a little ridiculous.
So instead I: used open refine to find all the urls, then assigned IDs to all the records, and then used filtering and copy and pasting to get them all in two columns ID, URL.

Issues: non-printing characters (Excel has a clean command), extra spaces (trim - didn't really work so I did a find and replace), random commas (some needed to be there), random other punctuation (find and replace), #sign

The idea in R was to do a for loop to iterate through each url, expand it, append it to a vector (or concatenate, whichever), then add that to the dataframe and do stats on it or maybe just export to Excel and monkey with it there.

For loop, fine, append - not for love nor money despite the fact that I have proof that I successfully did it in my Coursera class. I don't know. And the API was failing for certain rows. For a couple of rows, I found more punctuation. Then I found the rest of the issues were really all about length. They don't expect shortened urls to be long (duh)!  So then I had to pick a length, and only send ones shorter than that (50) to the api. I finally gave up with the stupid append, and I just printed them to the screen and copied them over to Excel. Also I cheated with how long the for loop had to be - I should have been able to just say the number of rows in the frame but meh.
Anyhow, this worked:

 setwd("~/ mine")
library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data <- read.csv("agu12justurl.csv", colClasses = "character")
#check it out
head(data)
str(data)
#test a single one
decode_short_url(data$url[2])
#this was for me trying to append, sigh
full.vec <- vector(mode="character")
#create a vector to put the new stuff in, then I'll append to the data frame, I hope
#check the for loop 
 for (i in 1:200){print(data.sub$url[i])}
#that works
for (i in 1:3){print(decode_short_url(data.sub$url[i]))}
#that works - good to know, though, that if it can't be expanded it comes back null

#appending to the vector is not working, but printing is so will run with that 
for (i in 1:1502){ if(nchar(data$url[i])>50){
 urlhold<-data$url[i]
 } else {
 urlhold<-decode_short_url(data$url[i])
 }
 print(urlhold)
 #append(full.vec,urlhold)
 }

If anyone wants to tell me what I'm doing wrong with the append, it would be appreciated. I'm sure it must be obvious.

So what's the answer? Not sure. I'll probably do a post on string splitting and counting... OR I'll be back in Open Refine. How do people only ever work in one tool?

3 responses so far

Keeping up with a busy conference - my tools aren't doing it

I wrote about trying to use TwitteR to download AGU13 tweets. I'm getting fewer and fewer with my calls. I was very excited to try Webometric Analyst from Wolverhampton and described by Kim Holmberg in his ASIST webinar (BIG pptx, BIG wmv).

One of the things Webometric Analyst will do is do repeated searches until you tell it to stop. This was very exciting. But I tried it and alas, I think Twitter thinks I'm abusive or something because it was way throttled. Like I could see the tweets flying up on the screen at twitter.com but the search was retrieving like 6. I ran the R search mid-day today and got 99 tweets back which covered  5 minutes O_o. I asked for up to 2000, from the whole day, and had it set to retry if stopped.

Sigh.

No responses yet

The #agu12 and #agu2012 Twitter archive

I showed a graph of the agu10 archive here, and more recently the agu11/2011 archive here, and now for the agu12/2012 archive. See the 2011 post for the exact methods used to get the data and to clean it.

#agu12 and #agu2012 largest component, nodes sized by degree

#agu12 and #agu2012 largest component, nodes sized by degree

agu12 and 2012 other components no iso sized by degree n1294

#agu12 and #agu2012 other components, no isolates, nodes sized by degree

I will have to review methods to show this, but from appearances, the networks are becoming more like hairballs. In the first year, half the people were connected to theAGU and the other half were connected to NASA, but very few were connected to both. The other prominent nodes were pretty much all institutional accounts. In 2011, that started to decrease and now in 2012 you can't really see that division at all. There are the top three nodes - two the same plus a NASA robotic mission - but then there's a large second group with degrees (connections to others) around 40-80 (combined indegree and outdegree) of individual scientists.

2 responses so far

An image of the #agu2011, #agu11 Twitter archive

A loooong time ago, I  showed the agu10 archive as a graph, here's the same for the combination of agu11 and agu2011. I mentioned already about the upper/lower case issues (excel is oblivious but my graphing program cares) - this is all lower case (I first tried to correct but kept missing things so I just used Excel's =LOWER()). I also discussed how I got the data. I'm going to have to probably go back and do this for 2010 if I really want equivalent images because 1) I only kept the first @ (this has all the @) 2) I don't believe I did both 2010 and 10 so I probably missed some. For this image I did a little bit of correcting. One twitter name spelled wrong and quite a few people using the_agu or agu instead of theagu. I also took out things that were like @10am or @ the convention center.

I made this graph by taking my excel spreadsheet that was nicely username first@ second@ .... and copying that into Ucinet's dl editor and saving as nodelist1. Then I visualized and did basic analysis in NetDraw.

agu2011 and agu11 largest component, sized by degree

agu2011 and agu11 largest component, sized by degree

The largest component is 559 nodes of 740 and this time you don't see that breakdown where the people who tweeted @NASA didn't tweet @ theAGU. There were 119 isolates and other components with 2,3, and 10 nodes:

Other components, sized by degree (no isolates)

Other components, sized by degree (no isolates)

eta: oh yeah, one other little fix. I took out random punctuation at the end of user names like hi @cpikas! or hey @cpikas: or  well you get the idea

No responses yet

New, now scientists can use blogs to talk to other scientists about science!

I collect articles on scientists using blogs and twitter. Mostly because it’s relevant to my dissertation, but also because I find them interesting. You can see a listing here: http://www.delicious.com/cpikas/meta_science_blogging (used to be displayed on my UM page, but that broke in the transition).

So one of these articles that I saw tweeted by about five people at the same time is Wolinsky, H. (2011). More than a blog. EMBO reports 12, 1102 - 1105. doi:10.1038/embor.2011.201 .

Of course it starts with the arsenic life discussion. It talks about the immediacy of the blog reaction and the tone of the discussion on the blogs.  Overall a nice article.

I think the subtitle of the piece is unfair. It acts like the title of this post when the article itself is more about where blogs have evolved to right now. There are a lot of differing experiences with blogs and differing uses, some of which have always been talking shop.

4 responses so far

Solution to my Twitter API - twitterR issues

With lots of help from Bob O’Hara (thank you!), I was able to solve my problems. I am looking at the tweets around #AGU10 but it occurred to me that I wanted to know what other tweets the AGU twitterers were sending while at the meeting because some might not have had the hashtag.

Here goes:

# Get the timeline
person <- userTimeline("person",n=500)

# Check to see how many you got
length(person)

# Check to see if that is far enough back
person[[500]]$getCreated()

# Get the time it was tweeted
Time = sapply(person,function(lst) lst$getCreated() )

# Get screen name
SN = sapply(person,function(lst) lst$getScreenName() )

# Get any reply to screen names
Rep2SN = sapply(person,function(lst) lst$getReplyToSN())

# Get the text
Text = sapply(person,function(lst) lst$getText())

# fix the date from number of seconds to a human readable format
TimeN <- as.POSIXct(Time,origin="1970-01-01", tz="UTC")

# replace the blanks with NA
Rep2SN.na <- sapply(Rep2SN, function(str) ifelse(length(str)==0, NA, str))

# make it into a matrix
Data.person <- data.frame(TimeN=TimeN, SN=SN, Rep2SN.na=Rep2SN.na, Text=Text)

# save it out to csv
write.csv(Data.person, file="person.csv")

 

So I did this by finding and replacing person with the screen name in a text editor and pasting that into the script window in Rcmdr. I found that 500 was rarely enough. Some I had to request up to 3200 tweets, which is the maximum. I had to skip one person because 3200 didn’t get me back to December. It’s also worth noting the length() step. It turns out that when you ask for 500 you sometimes get 550 and sometimes get 450 or anywhere in between and it’s not because there aren’t any more. You may also wonder why I wrote the whole thing out to a csv file. I could have had a step to cut out the more recent and older tweets to have just the set there for more operations within R. I need to actually do qualitative content analysis on the tweets and I plan to do that in NVIVO9.

I didn’t do this for all 860, either. I did it for the 30 or so who tweeted 15 or more times with the hashtag. I might expand that to 10 or more (17 more people). Also, I didn’t keep the organizational accounts (like theAGU).

With that said, it’s very tempting to paste all of these data frames together, remove the text and do the social network analysis using iGraph. Even cooler would be to show an automated display of how the social network changes over time. Are there new connections formed at the meeting (I hope so)? Do the connections formed at the meeting continue afterward? If I succumb to the temptation, I’ll let you know. There’s also the the textmining package and plugin for Rcmdr. This post gives an idea of what can be done with that.

2 responses so far

My ongoing struggle with the Twitter API, R, … copy paste

I’m posting this in hopes that someone with experience in any/all of the above or maybe Perl, can point out that I’m doing something stupid or have overlooked something obvious. If nothing else, you might read this to see what not to try.

Here’s the issue: it’s totally obvious that I need to look at the other tweets that were sent by #agu10 tweeters (the ones not marked with the hash tag) if I want to understand how Twitter was used at the meeting. But it’s now five months later and there are 860 of them (although I would be fine with looking at the most prolific non-institutional tweeters).

I first looked at the Twitter API and I tried just adding terms to URLs and got the recent timelines for a user at a time but I couldn’t see a way to get a user’s timeline for a set period of time (the conference time period +a week on each end, or so).

I asked two experts and they both said that you couldn’t combine the user timeline with a time period.

Darn. So my next idea was to see if I could actually access someone’s timeline that far back through the regular interface. I tried one of the more prolific tweeters and I could. Ok, so if I can pull down all of their tweets, then I could pick out the ones I wanted. Or, even better, I could also look at the evolution of the social network over time. Did people meet at the meeting and then continue to tweet at each other or are these people only connected during the actual meeting?  Did the network exist in the same way before the meeting?

I was looking for ways to automate this a bit and I noticed that there were things already built for Perl and for R. I used Perl with a lot of handholding to get the commenter network for an earlier paper and I used R for both that same article and in lieu of STATA for my second semester of stats. I’m not completely comfortable with either one and I don’t always find the help helpful. I decided to start with R.

The main package is twitteR by Jeff Gentry. I updated my installation of R and installed and loaded that package and the dependencies. First thing I did was to get my own standard timeline:

testtweets <- userTimeline("cpikas")

Then I typed out the first few to see what I got (like when you’re using DIALOG)

testtweets[1:5]

And I saw my tweets in the format:

[[1]]

[1]”username: text”

I checked the length of that and got 18 – the current timeline was 18 items. I tried the same thing substituting user id but that didn’t work. So then I tried to retrieve 500 items and that worked fine, too.

testlonger <- userTimeline ("cpikas", n=500)

Great. Now, let me see the dates so I can cut off the ones I want. Hm. Ok, let’s see, how to get the other columns. What type of object is this anyhow? The manual is no help. I tried some things with object$field. No joy. Tried to edit. no joy – it was upset about the < in the image url. And it was also telling me that the object was of type S4. The manual said it wasn’t but I can’t argue if that’s what it’s reading. I somehow figured out it was a list. I tried object name [[1]][2]  - null. Then I eventually tried

str(testweets)

Hrumph. It says 1 slot. So as far as i can tell, it’s a deprecated object type and it didn’t retrieve or keep all of the other information needed to narrow by date.

When googling around, I ran across this entry by Heuristic Andrew on text mining twitter data with R. I didn’t try his method with the xml package yet (may try that). I did try the package that was listed in the comments tm.plugin.webcorpus by Mario Annau. That does get the whole tweet and put the things in slots the right way (object$author), but it looks like you can only do a word search. Oh wait, this just worked:

testTM <- getMeta.twitter('from:cpikas')

But that’s supposed to default to 100 per page, 100 things returned and it only returned 7 for me. I guess the next thing to try is the XML version unless someone reading this has a better idea?

edit: forgot the copy paste. When I tried to just look at the tweets i wanted on screen and then copy them into a text document it crashed firefox. who knows why

18 responses so far

Blogs are not dead yet

Various assorted pundits have been heralding the death of the blog as a science communication medium for at least five years, probably longer. Blogs aren’t dead, indeed, as far as I can tell, they are now in a revival period in which their true utility and value is becoming more obvious.

This blog post was prompted by a post on Scholarly Kitchen in which the blogging scientist (or science-trained publisher) blogs about how scientists don’t blog (again). David Crotty titled his post: Not With A Bang: The First Wave of Science 2.0 Slowly Whimpers to an End. Crotty views the attempted monetization of the science blogosphere as the crest of the first wave. He discusses several examples of for-profit companies that exuberantly jumped into the blogosphere and other science 2.0 things but have since pulled back.  I would assert that the attempted monetization and commercialization of science 2.0 is external to the movement and really a distraction from the slow growth phase of the innovation adoption curve.

First, all of the bloggers now on a for-profit host started on wordpress.com, blogger, or some similar service. They garnered enough interest to be attractive to a company that hoped to make money on page views. Many of the early adopters moved over from updating static websites, keeping newsletters, participating on newsgroups, or participating on bulletin boards. They may have continued to participate in these platforms, but saved longer discussions for their blogs. Otherwise, they might have used their blogs to re-share links they would normally have put on a static website or on the young delicious but that weren’t getting enough visibility. This was the first wave of pioneers.

The idea that a media company could get inexpensive talent by mining the blogosphere came later. In the beginning the primary for-profit (or for loss, unfortunately) was ScienceBlogs. Even at its height, ScienceBlogs was never more than a tiny part of the science blogosphere. Its limited size made it more exclusive and more watched. Others who did not know about the rich online life of scientists saw ScienceBlogs as the entire science blogosphere. Seed Media told a good story and made it look profitable so others wanted to get in. I’m not sure about Nature, but I’m sure they were clear that supporting blogs would not be a profitable effort. I think the goal for them was to support science and to get scientists to spend more of their time online looking at Nature Publishing sites. It’s not important.

When some of the shine wore off, and some of the bloggers left, the rest of the blogosphere got more attention. I still feel that the rest of the blogosphere doesn’t get the attention it deserves, but as with everything people do, there’s a long tail.

In the past few months, some of the long-time bloggers went into a blogging funk (including yours truly). At the same time, additional scientists started blogs. Some bloggers went on hiatus, some quit, but others started, and some who quit earlier came back. Societies and non-profits have stepped up to support science blogging. This is a great idea as the purpose of the societies is to support science communication in their subject area.

Some who have discussed the death of blogs originally said that wikis would take over. If you’ve used a wiki, you know they are very good for certain things, but there’s almost no overlap with what blogs are good for. Likewise, many people thought Twitter would replace blogs. Using twitter can be an art form- the concise nuggets of information or questions in under 140 characters. Recent it’s become more and more clear that the long form not only still has value, but is still needed. It’s needed to provide context and to tell the whole story.

What about the lack of or surfeit of journalism-trained bloggers. Which is it? Does it matter? The science blogosphere has always been made up of practicing scientists, people working in some area adjacent to science with some level of science training (like librarians), and non-scientists who are interested in science. There are bloggers in each of these varieties who communicate well and are good at telling a story. It’s very welcome that a lot of the very talented science journalists have taken up blogging. For them it’s not a longer form but often a shorter form. I don’t think there are too many or not enough nor do I think that they are any more important or valuable than the scientists who blog. Nor do I think that all members of the science blogosphere should have journalism training or strive to journalistic standards.  We could all stand to write better, but we’re all writers. Scientists have to write for their profession so blogging really isn’t that much of a stretch.

As for the question of culture and technology. They co-evolve. Does the science blogosphere change science or science culture? Does science culture determine what technologies will be used and how? Yes. Both. All the time. Is there a lot of inertia? Oh yes.

6 responses so far

Post I’d like to write: trolls by discipline

I noted in both my qualitative study(pdf) of science blogging and my social network analysis study (pdf) that there are more trolls in some areas of science blogs than others, and they’re pretty detectable by looking at the patterns of links.

Anyhoo, seems like although some fields get more trolls than others, each field has a unique set of trolls with different approaches. Now I’m not talking about people who actually have substantive arguments that further the conversation, I’m talking about obnoxious people who hijack the conversation.

Actually, with that said, it would be kind of interesting to have a typology of the various flavors of pseudo science activists and what not who cause hate and discontent in blog comments. You have the anti-feminists, the anti vaxxers, etc.

Not really a troll, but this post about the sorts of e-mail meteorologists get got me started on this. Sigh.

11 responses so far

Older posts »