Kindle Unlimited. Distruction? No.

(by Christina Pikas) Jul 20 2014

Amazon announced a service where for a monthly fee (currently $9.99) you can read unlimited ebooks a month (from a stock of 600k). John Dupuis has been linking to various stories about it on Twitter. One from Vox suggests the new service might lead to library's destruction what with funding issues what they were a few years ago.

Many others have pointed out the issues with that. The first being that most libraries offer a service free (because of your tax dollars) that lets you check out unlimited books a month for just about any device. Mine limits me to 6 books at a time from one of the 3 or so services they offer and a certain number of "units" or "credits" from another of the services.

If you have the $10/month, though, this would mean no waiting where you often have to wait for popular titles at the public library because of idiotic requirements from the publishers that make these services treat ebooks like print books - one user/copy at a time.

Today, John pointed to this piece by Kelly Jensen on Book Riot. Jensen is all offended at people saying libraries are the NetFlix for books because libraries do so much more. Well, yes, but I just can't get offended. I've found and a local survey has shown that people often don't know what ebooks are, how they can get them, and what is available at their library. They do know about Amazon, but they don't know about Overdrive. Amazon is easy, Overdrive can be hard (once you get going it's very simple, though). I actually think it's helpful and useful and not too reductionist and problematic to refer to libraries as the NetFlix for books. Emphasizing and publicizing this one small service won't cause people with small children to forget about story time, college students and professors to forget about doing research, avid readers to forget about print books. It may, however, bring in new users or bring back users/patrons who have been too busy to come in person or who are now home bound for some reason.

What I do wonder about is licensing. There has been lots of discussion about the big 6 and how they really, really don't want libraries to lend ebooks. Some have done stupid things like say libraries have to rebuy the book after it has been checked out 26 times or so. Others have delays or flat-out won't license ebooks to libraries. (not talking research libraries and STEM books - we can get them if we want to spend the $ and they aren't textbooks). One big publisher gave in recently - but sort of slowly.

Amazon's in a big fight with one publisher with all sorts of shenanigans like slowing down shipping for their books. The authors are all up in arms. It's a mess. With that uneasy relationship, I really am curious about publishers participating in this program. Do they see it differently than the library products? Is it just the same ones that do license for library use? 600k books - but which 600k? Presumably the entire Project Gutenberg library is on there (see "catching up on classics") .... and some other books are featured on the home page. I don't have time to do the analysis, but I'm curious.

$10/month could add up... particularly if the catalog isn't that large. I think it sounds like a good service, but the devil is in the details. What does the catalog look like? Are people out of money to spend on entertainment with all the video downloading services and internet and data and what not? Seems like an obvious move for Amazon (they aren't the first - Oyster and Scribd have similar services). We'll see, I guess.

No responses yet

I'm a coding fool... use of the nuggets mentioned last post

(by Christina Pikas) Jul 18 2014

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  #this finds matching records and then keeps just the part we want
  #one of these days i'll just import right the first time instead of this part
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  total<- sum(phys.hold)  

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

No responses yet

Another little R nugget

(by Christina Pikas) Jul 16 2014

Actually 2 - I'm on a tear. Using what I've learned for work!



pretty cool, huh? and useful. Found on StackOverflow. Other ideas there, too.


Opening an Excel file. Now what you're probably thinking is: "just save as csv or tab delimited and open that the normal way"... well yes, but I've got like 50 files and they were exports from the database as .xls and I didn't open them each while exporting to re-save.

So opening is one thing but I need a couple of columns out of the first sheet and there's some junk at the top that I don't need.

Package is xlsx (pdf). I had just installed Python (thought I had it at work, but I guess not) and was preparing to do it there, but then I was thinking that surely R would have a way.

The issue in my case with my work computer is that it's a 64-bit, but I have 32-bit java, 32-bit office, etc. I was running R in RStudio as 64-bit. I tried to do everything to get Java updated and 32-bit (but still not killing other things I needed). Finally - duh- just pointed RStudio at the 32-bit version of R installed and then RJava ran just peachy and it's required for xlsx.

Here's the command I'm using so far:
file<-read.xlsx("analysis.xls",1,startRow=4, colClasses=c("character", "numeric"))

1 is the sheet index. The colClasses is pretty slick, too. You can also name sheets and only grab certain columns.

So now to iterate through all the files in a directory, opening, counting up the articles listed for the categories I have in another vector and reporting that out. Should be cool. Or not.  And I promised to have the results - however they're done - Friday and I'm going to be out tomorrow. Back to it!

2 responses so far

Trying another CAQDAS, MaxQDA

(by Christina Pikas) Jul 12 2014

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Quick note: Now on GitHub

(by Christina Pikas) Jul 12 2014

Scripts mentioned previously are now on GitHub with an MIT license which should hopefully be permissive enough. I can't say that anyone would want to use these, but this also backs up my code even better which is useful.

I'm using RStudio so probably if/when I do more analysis in R, I'll just use Git from there.

The url is, unsurprisingly:

One response so far

Getting blog post content in a usable format for content analysis

(by Christina Pikas) Jul 04 2014

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:
 UserAgent <- "pick something"
 temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
 nameout <- paste(fileName, ".htm", sep="") 
 write (temp,file=nameout)

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("","archivep1" )

More often I did a few:


So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

    soup = BeautifulSoup(open(filename))
    posts = []
    for post in soup.find_all('div', class_='post'):
        title = post.find('h3', class_='post-title').text.strip()
        author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
        content = post.find('div', class_='post-body').p.text.strip()
        date = post.find_previous_sibling('h2', class_='date-header').text.strip()

        posts.append({'title': title,
             'author': author,
             'content': content,
             'date': date})
    #print posts
    posts = str(posts)
    posts_file.write (posts)

# this is from

for filename in os.listdir("files"):

print ("All done!")


So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

No responses yet

My current to-read list

(by Christina Pikas) Jun 27 2014

I've been keeping a million tabs open at work and a home, because I haven't even had the time to add things to my citation manager... I also have some things in print that I've been carrying back and forth to work every day (boy is my bag heavy!).  Most of these things probably rate a post of their own, but sigh...  Add to that my obsession du jour with screenscraping and text mining using R, Python, and Perl.... and the fact that I'm not good at it so everything takes longer (also would take less time if I actually RTFM instead of just hopped to code and tried it).

So here are some things on my radar (I'm giving no credit to whoever pointed me to these because I honestly don't remember! Sorry):

  • Hadas Shema,  Judit Bar-Ilan,  Mike Thelwall (in press) How is research blogged? A content analysis approach. JASIST. DOI: 10.1002/asi.23239
    She tweeted a link to the pre-print if you don't have access. I got a about 2/3 through this as soon as I saw it announced and then realized I had been working on a very important work thing and dropped it. Very interesting so far.
  • Lisa Federer (2014) Exploring New Roles for Librarians: The Research Informationist.Synthesis Lectures on Emerging Trends in Librarianship. New York: Morgan and Claypool. doi:10.2200/S00571ED1V01Y201403ETL001
    I was first like meh about this (another name) but then I relooked and I'm interested in their version of embedding.
  • Vanessa P. Dennen. (2014) Becoming a blogger: Trajectories, norms, and activities in a community of practice. Computers in Human Behavior 36, 350-358, doi: 10.1016/j.chb.2014.03.028
  • Paige Brown (11 June 2014) How Academics are Using Social Media. From the Lab Bench.
    This and all the linked reports look very interesting.
  • Pablo Moriano,Emilio Ferrara,Alessandro Flammini,Filippo Menczer (2014). Dissemination of scholarly literature in social media.
  • Jeff Seaman and Hester Tinti-Kane (2013) SOCIAL MEDIA FOR TEACHING AND LEARNING. Boston: Pearson Learning.
    This was probably cited in the blog post above.
  • Liu, Y., Kliman-Silver,C.,Mislove,A. (2014) The tweets they are a-changin': Evolution of Twitter Users and Behavior. ICWSM. (google for it - I have the printout)
    This was mentioned by some folks from MPOW who went to the conference. Provides a nice overview.
  • Tenopir, C, Volentine,R., King, DW, (2013) Social Media and scholarly reading. Online Information Review 37, 193-216. doi: 10.1108/oir-04-2012-0062
    I might have actually read this but it's still riding around in my bag
  • Nentwich, M., König, R.. (2012). Cyberscience 2.0: Research in the age of digital social networks. Frankfurt: Campus Verlag.
    This one is time sensitive as I borrowed it from Columbia.
  • Holmberg, K. Thelwall, M(2013) Disciplinary differences in twitter scholarly communication. ISSI Proceedings 2013.  <- that was typed from my handwriting and not checked. google for it. I think I may have read this, but i have it in the stack to read again
  • Thelwall et al (in press) Tweeting links to academic articles. Cybermetrics J (google for preprint)
  • Haustein, et al. Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature. ArXiv 1308.1838
  •  Sayes,E. (2014) Actor–Network Theory and methodology: Just what does it mean to say that nonhumans have agency? Social Studies of Science 44, 134-149.  doi:10.1177/0306312713511867

And this is just on my screen or in my bag. I think the babies tore up 3 articles i had waiting to be read by my couch :(  So far behind!


No responses yet

Sizing bars in a bar chart in R

(by Christina Pikas) Jun 24 2014

Yet another stupid thing... but I did it so here's to remembering how.

I wanted to show of all the places my place of work (MPOW) published in the past 5 years, what their impact factors were and how many in each venue. (yes, caveat the IF but this is in response to a request)

So I have a citation manager collection with the articles we've written, collected through database alerts in all the major databases. I exported that and cleaned up the journal names in VantagePoint (not affiliated, yadda, yadda... use Open Refine if you don't have VP), and then laboriously a co-worker and I went through and added the IFs. Then I created a shortened name for each journal (woulda been easier if I kept the official abbr) by first replacing journal with j, transactions with trans, proceedings with proc, letters with let, etc. Then using the Excel

=if (LEN(A2)>25,LEFT(A2,25),A2)

Then copied values and then saved as CSV: short name, number of articles published, IF.

Here's how I graphed it.

w <- mpow2010.2014jnl$total.articles
pos <- 0.5 * (cumsum(w) + cumsum(c(0, w[-length(w)])))
x <- c (1:257)
y <- mpow2010.2014jnl$Impact.Factor
my.labs <- mpow2010.2014jnl.s$Jnl.Short

p<-ggplot() + 
  geom_bar(aes(x = pos, width = w, y = y, fill = x ), stat = "identity") + 
  scale_x_continuous(labels = my.labs, breaks = pos) 

p + ylab("Impact Factor") + xlab("Journal") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")

The labels were a wonky afterthought. not elegant... and I wanted to get rid of the legend. Note I just made a blank vector for x and then added the labels later. It worked... not pretty.

I would like to thank the kind folks on stackoverflow:

Here it is - intentionally smushed so I don't give away too much. I hope. I can remove image if there's an issue, lmk, don't sue.

Where MPOW published 2010-2014 June

Where MPOW published 2010-2014 June

I ended up going back and zooming in on pieces. I shoulda made a function so I could just look at whatever part I wanted.. sigh.


No responses yet

Using R TwitteR to Get User Information

(by Christina Pikas) Jun 07 2014

I'm gonna keep stating the obvious, because this took me a few hours to figure out. Maybe not working continuously, but still.

So, I have like more than 6000 tweets from one year of AGU alone, so I'm gonna have to sample somehow. Talking this over with my advisor, he suggested that we have to find some reasonable way to stratify and then do random within the stratification. I haven't worked all the details out yet - or really any of them - but I started gathering user features I could base the decision on. Number of tweets with the hashtag was super quick in Excel. But I was wondering if they were new to Twitter, if they tweeted a lot, and if they had a lot of followers. That's all available through the api and using the TwitteR package by Jeff Gentry.  Cool.

So getUser() is the function to use. I made up a list of the unique usernames in Excel and imported that in. Then I went to loop through.

library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
 data USERdata<-vector()
 temp<-getUser(USER, cainfo="cacert.pem")
 #test for users 4-6<-sapply(data$user[4:6],userInfo)

But that was sorta sideways... I had a column for each user... sorta weird. Bob O'H helped me figure out how to transpose that and I did, but it was sorta weird.

So then I tried this way:<-function(startno,stopno){
# set up the vectors first
for (i in startno:stopno) {
thing<-getUser(data$user[i], cainfo="cacert.pem")[i]<-data$user[i]

return(data.frame(,created=USER.created, posts=USER.posts,followers=USER.foll, stringsAsFactors=FALSE))

So that was cool, until it wasn't. I mean, turns out that 2% of the users have deleted their accounts or block me or are private or something. So it didn't recover from that error and I tried to test for is.null() and is.NA() but it failed....
So then I went back to the mailing list and there was a suggestion to user try() but eek.
So then I noticed that if you have a pile to look through you're actually supposed to use
lookupUsers(users, includeNA=FALSE, ...)
And I did, and I wanted to keep the NA so that I could align with my other data later... but once again, no way to get the NAs out. And it's an object that's a pile of lists... which I was having trouble wrapping my little mind around (others have no issues).
So I went back and used that command again, and this time said to skip the NA (the not found users). Then I think from the mailing list or maybe from Stack Overflow? I had gotten the idea to use unlist. So here's what I did then:
easy.tweeters.noNA<-lookupUsers(data$user, cainfo="cacert.pem")
#check how many fewer this was
#1247 so there were 29 accounts missing hrm
for (i in 1:1247){holddf<-twListToDF(easy.tweeters.noNA[i])

And that created a lovely dataframe with all kinds of goodies for it. I guess I'll have to see what I want to do about the 29 accounts.

I really would have been happier if it was more graceful with users that weren't found.

Also, not for every single command you have to user the cainfo="cacert.pem" thingy... Every time, every command.

ALSO, I had figured out oauth, but the twitter address went from http:// to https:// and so that was broken, but I fixed it. I hope I don't have to reboot my computer soon! (Yeah, I saved my credentials to a file, but I don't know... )

No responses yet

Using R to expand Twitter URLS

(by Christina Pikas) May 25 2014

So this should be so simple and obvious that it's not worth a post, but I keep forgetting how to do everything so I'm gonna put this here to remind myself.

Here's where I am. I have a list of 4011 tweets with #agu12 or #agu2012 hashtag. A lot of these are coding as "pointers" - their main function is to direct readers' attention somewhere else. So I got to wondering: where?  Are they directing people to online versions of the posters? Are they just linking to more NASA press releases?  % going to a .edu?

Of course all the URLs are shortened and there are services you can use to expand, but in R, it's already right there in the TwitteR package as


This uses the API . All you have to do is plug in the URL. Cool!

So here was my original plan: find the tweets with urls, extract the urls, expand them, profit! And I was going to do all this in R. But then it got a little ridiculous.
So instead I: used open refine to find all the urls, then assigned IDs to all the records, and then used filtering and copy and pasting to get them all in two columns ID, URL.

Issues: non-printing characters (Excel has a clean command), extra spaces (trim - didn't really work so I did a find and replace), random commas (some needed to be there), random other punctuation (find and replace), #sign

The idea in R was to do a for loop to iterate through each url, expand it, append it to a vector (or concatenate, whichever), then add that to the dataframe and do stats on it or maybe just export to Excel and monkey with it there.

For loop, fine, append - not for love nor money despite the fact that I have proof that I successfully did it in my Coursera class. I don't know. And the API was failing for certain rows. For a couple of rows, I found more punctuation. Then I found the rest of the issues were really all about length. They don't expect shortened urls to be long (duh)!  So then I had to pick a length, and only send ones shorter than that (50) to the api. I finally gave up with the stupid append, and I just printed them to the screen and copied them over to Excel. Also I cheated with how long the for loop had to be - I should have been able to just say the number of rows in the frame but meh.
Anyhow, this worked:

 setwd("~/ mine")
library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data <- read.csv("agu12justurl.csv", colClasses = "character")
#check it out
#test a single one
#this was for me trying to append, sigh
full.vec <- vector(mode="character")
#create a vector to put the new stuff in, then I'll append to the data frame, I hope
#check the for loop 
 for (i in 1:200){print(data.sub$url[i])}
#that works
for (i in 1:3){print(decode_short_url(data.sub$url[i]))}
#that works - good to know, though, that if it can't be expanded it comes back null

#appending to the vector is not working, but printing is so will run with that 
for (i in 1:1502){ if(nchar(data$url[i])>50){
 } else {

If anyone wants to tell me what I'm doing wrong with the append, it would be appreciated. I'm sure it must be obvious.

So what's the answer? Not sure. I'll probably do a post on string splitting and counting... OR I'll be back in Open Refine. How do people only ever work in one tool?

3 responses so far

Older posts »