Government cost recovery gone awry: PACER and NTIS

(by Christina Pikas) Aug 27 2014

(reiterating these are just my personal opinion and do not reflect anything from my place of work - if you know what that is - or anything else)

For many years, the US federal government has tried to cut costs by outsourcing anything that isn't inherently governmental, making sure that government doesn't compete with industry, and requiring cost recovery for government agencies that provide services to other agencies (see A-76 ).

Old examples that might have changed: GPO had to do all printing of history books for military historians, but the quality was ok, the distribution was crap, and the DoD history organizations and readers had to pay a lot of money. So what they did when I worked there was to give the book to a university press that would do a decent job with it. The books were not copyrightable anyway because they were work for hire by a government employee. Everyone was happy. Another old example was that Navy was required to send all records to NARA. But then Navy all the sudden had to pay NARA to keep the documents (I think this has changed - my example is from late 1990s). This was things like deck logs. Hugely important documents.

NTIS has long been caught up in this. Agencies producing technical reports are required by law to send them to NTIS (if they are unlimited distribution). NTIS is required to recover the cost of their administration and archiving by selling the documents. This is hard because first, agencies are not thorough in sending stuff to NTIS (often because their central repository isn't even getting copies - even though required by regulations, instructions, etc.) and second, agencies make these documents available for free from their own sites.  NTIS also has picked up a few bucks here and there doing web and database consulting and licensing their abstracting and indexing database to vendors who resell to libraries. Why pay for it from a third-party vendor? Cross search with your favorite engineering database. Better search tools.

PACER is also caught up in this. There's actually a law that says US Courts has to recover the cost of running the system by charging for access or for documents. They do not want to but there is a law that they must obey.  This is information that really should be freely available and easily accessible. A famous activist tried to download the whole thing and make available, but he was stopped.

The results of forcing these agencies - GPO, NTIS, US Courts - to recover their costs are great and they directly work against the open government we need and deserve. It causes the agencies to cut corners and not have the systems they need. It causes customer agencies and citizens alike to distrust and dislike them.

Now, US Courts has removed large collections of historical documents from PACER because of an IT upgrade. Read the Washington Post article. Various people in Congress are trying to shut NTIS down, again. GPO seems to be ok, for now - lots of cool neat things from them.

Libraries  - like mine - have been burdened by cost recovery, too, and it often signals the beginning of the end. Superficially, makes sense to show how much something is valued and by whom. In practice, you need a lot more accounting systems and controls over the professional workers that prevent them from doing their job. These services are directly in support of strategic requirements (open government and accountability) but are infrastructure. People are blind to infrastructure until it's no longer there.  NTIS, PACER, GPO and others need to stop with this cost recovery business (meaning Congress has to pass a law that removes that requirement) and be funded as infrastructure. Outsource to get needed skills you can't hire in government, but be smart about it.

No responses yet

Fragile knowledge of coding and software craftsmanship

(by Christina Pikas) Aug 11 2014

To continue the ongoing discussion, I think my concerns and experiences with informal education in coding (and perhaps formal education offered to those not being groomed into being software engineers or developers) fall into two piles: fragile knowledge and craftsmanship.

Fragile knowledge.

Feynman (yes, I know) described fragile knowledge of physics as learning by rote and by being able to work problems directly from a textbook but not having a deeper understanding of the science that enables application in conditions that vary even slightly from the taught conditions. I know my knowledge of physics was fragile - I was the poster child of being able to pass tests without fully understanding what was going on. I didn't know how to learn. I didn't know how to go about it any other way. I had always just shown up for class, done was I was asked, and been successful. In calculus I was in a class that had discussion sections in which we worked problems in small groups - is this why my knowledge isn't fragile in that or is it that I did have to apply math to physics problems? Who knows.

Looking back, now, it seems like a lot of the informal education I've seen for how to code is almost intentionally aimed at developing fragile knowledge. It's not how to solve problems with code and building a toolkit that has wide application. Showing lots of examples from different programs. It's more like list the n types of data.



There is actually a movement with this name and I didn't take the time to read enough about it to know if it matches my thoughts. Here I'm talking coding environment, code quality, reproducibility, sharing.... Not only solving the problem, but doing it in a way that is efficient, clean, and doesn't open up any large issues (memory leaks, openings for hackers, whatever else). Then taking that solution and making it so that you can figure out what it did in a week or so or so that you could share with someone else who could see what it did. Solving the problem so that you can solve the same problem with new data the same way. My code is an embarrassment - but I'm still sharing, because it's the best I know how to do and at least there's something to start with.

A couple of people suggested the Software Carpentry classes - they sound great. Maybe SLA or ASIST or another professional association could host one of these as a pre-conference workshop? Maybe local (US - Maryland - DC ) librarian groups could host one?  We could probably get enough people.

No responses yet

What I want/need in a programming class

(by Christina Pikas) Aug 08 2014

Abigail Goben (Hedgehog Librarian) has a recent blog post discussing some of the shortcomings she's identified in the various coding courses she's taken online and the self-study she has done.

I think my view overlaps hers but is not the same. Instead of try to compare and contrast, I'll say what I've seen and what I need.

I'm probably pretty typical of my age: I had BASIC programming in elementary and high school. This was literally BASIC and was like

10 print "hello"
20 goto 10

I think we did something with graphics in high school, but it was more BASIC.  In college, they felt very strongly that physics majors should learn code, so I took the Pascal for non-CS majors in my freshman year.  That was almost like the BASIC programming: no functions, no objects... kinda do this, do this, do this... turn it in. I never did see any connection whatsoever with my coursework in physics. I never understood why I would use that instead of the Mathematica we had to use in diffeq

In the workforce, I did some self study javascript (before it was cool), html, CSS - not programming, obviously. And then I needed to get data for an independent study I was doing and my mentor for that study wrote a little Perl script to get web pages and pull out links. The script she wrote broke with any modifications to the website template, so after waiting for her to fix for me, I ended up fixing it myself... which I should have done to start with. ... In the second stats class another student and I asked if we could use R instead of Stata. He was going back to a country with less research funding and I was going to work independently. But then, we just used the regression functions already written out and followed from a book. Elsewhere in the workforce I've read a lot about R and some co-workers and I worked through a book... I did the CodeAcademy class on Python.

All of these classes - if they weren't in interactive mode, they could have been. What are the various data types. How do you get data in there and back out again. How do you do a for loop. Nobody really goes into any depth about lists in R and they pop up all over the place. I couldn't even get Python installed on my computer at first by myself because everyone teaching me was on a Mac. (btw, use active python and active perl if you're on Windows - not affiliated, but they just work).

The R class on Coursera (same one she complains about) and the data science class by JH there were the first that even really made me do functions. What a difference. I really appreciated them for that.

So here's what I think:

People new to programming - truly new - need to understand the basics of how any program works including data types, getting data in and out, for loops. But also architectural things like functions and objects. They probably need to spend some time with pseudocode just getting through the practice.

Then if you're not new to programming, but you're new to a language - different course. In that course you say this is how this language varies, this is what it does well with, here's where it fails.

Then there needs to be an all about software design or engineering or process course that talks about version control and how to use it. How to adequately document your code. How to write programs in a computationally efficient way. The difference between doing things in memory or not.  What are integrated development environments and when would you use one. This is what I need right now.

If it's something basic, I can follow along a recipe I can read off of stack overflow, but I know nothing about efficiency. Like why use sapply vs. a for loop? Is there a better way to load the data in? Why is it slow? Is it slower than I should expect? I love RStudio - love, love, love! But I tried something like that for Python and could never get it to work. I'm still learning git, but I don't really understand the process of it even though I can go through the steps.

Anyhow, more about me, but I think I'm probably pretty typical. I think there's a huge gap in the middle in what's being taught and I also think that a lot of people need the very basics of programming almost minus the specific language.

6 responses so far

Mom always said: Clean as you go

(by Christina Pikas) Jul 31 2014

But do we listen? Not so much.

Turns out MPOW (a division of a larger institution) has not deleted *any* borrower records since we merged into the larger institution's catalog maybe 10 years ago. We used to have our own catalog back in the day. No idea how much we maintained that stuff either - last time I did a dump from it to create an electronic badge-scanning sign-in system for an open house - we had users with the names "brontosaurus", "washington, george", "gibson, r.e." (which you won't get unless you know more about where I work, well and he's been dead for a couple of decades).  I think the professionals who ran the larger system helped us clean up a bit on migration.

So here we are, ready to integrate further with automatic registration and maintenance of records and we figured we should probably clean up prior. Oy.

Turns out in Sirsi Dynix Horizon, you have to identify the borrower and *edit* their record to have the option to delete. It was always grayed out for us because we didn't think about having it open for editing first. All this time we've been getting notices of employee actions, but have done nothing. We used to be a required stop on the employee checkout list but they took us off when we got rid of our print collection.

Now, how to match current employees? I can get a list but the export from Horizon shows how poorly we did data entry when creating the accounts. Some have the whole name in various orders in the last name field. Some have that with periods in it. There's a name of a university there (why?). E-mails missing, employee numbers missing, obsolete borrower types. People who have joint appointments with other divisions of the larger institution who have these weird hybrid records.

At first pass with a short R script, I identified 500 records of the 3500 that need to be checked. And that was only using last names so if there are 10 bad Smiths for the one good Smith, then they get a pass. I'm sure we'll get exception reports or something at the load, but we're trying to get ahead of the game.

So kids: do as your mom told you and CLEAN AS YOU GO!

No doubt I will continue not to heed this advice, either :)

No responses yet

Kindle Unlimited. Distruction? No.

(by Christina Pikas) Jul 20 2014

Amazon announced a service where for a monthly fee (currently $9.99) you can read unlimited ebooks a month (from a stock of 600k). John Dupuis has been linking to various stories about it on Twitter. One from Vox suggests the new service might lead to library's destruction what with funding issues what they were a few years ago.

Many others have pointed out the issues with that. The first being that most libraries offer a service free (because of your tax dollars) that lets you check out unlimited books a month for just about any device. Mine limits me to 6 books at a time from one of the 3 or so services they offer and a certain number of "units" or "credits" from another of the services.

If you have the $10/month, though, this would mean no waiting where you often have to wait for popular titles at the public library because of idiotic requirements from the publishers that make these services treat ebooks like print books - one user/copy at a time.

Today, John pointed to this piece by Kelly Jensen on Book Riot. Jensen is all offended at people saying libraries are the NetFlix for books because libraries do so much more. Well, yes, but I just can't get offended. I've found and a local survey has shown that people often don't know what ebooks are, how they can get them, and what is available at their library. They do know about Amazon, but they don't know about Overdrive. Amazon is easy, Overdrive can be hard (once you get going it's very simple, though). I actually think it's helpful and useful and not too reductionist and problematic to refer to libraries as the NetFlix for books. Emphasizing and publicizing this one small service won't cause people with small children to forget about story time, college students and professors to forget about doing research, avid readers to forget about print books. It may, however, bring in new users or bring back users/patrons who have been too busy to come in person or who are now home bound for some reason.

What I do wonder about is licensing. There has been lots of discussion about the big 6 and how they really, really don't want libraries to lend ebooks. Some have done stupid things like say libraries have to rebuy the book after it has been checked out 26 times or so. Others have delays or flat-out won't license ebooks to libraries. (not talking research libraries and STEM books - we can get them if we want to spend the $ and they aren't textbooks). One big publisher gave in recently - but sort of slowly.

Amazon's in a big fight with one publisher with all sorts of shenanigans like slowing down shipping for their books. The authors are all up in arms. It's a mess. With that uneasy relationship, I really am curious about publishers participating in this program. Do they see it differently than the library products? Is it just the same ones that do license for library use? 600k books - but which 600k? Presumably the entire Project Gutenberg library is on there (see "catching up on classics") .... and some other books are featured on the home page. I don't have time to do the analysis, but I'm curious.

$10/month could add up... particularly if the catalog isn't that large. I think it sounds like a good service, but the devil is in the details. What does the catalog look like? Are people out of money to spend on entertainment with all the video downloading services and internet and data and what not? Seems like an obvious move for Amazon (they aren't the first - Oyster and Scribd have similar services). We'll see, I guess.

No responses yet

I'm a coding fool... use of the nuggets mentioned last post

(by Christina Pikas) Jul 18 2014

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  #this finds matching records and then keeps just the part we want
  #one of these days i'll just import right the first time instead of this part
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  total<- sum(phys.hold)  

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

No responses yet

Another little R nugget

(by Christina Pikas) Jul 16 2014

Actually 2 - I'm on a tear. Using what I've learned for work!



pretty cool, huh? and useful. Found on StackOverflow. Other ideas there, too.


Opening an Excel file. Now what you're probably thinking is: "just save as csv or tab delimited and open that the normal way"... well yes, but I've got like 50 files and they were exports from the database as .xls and I didn't open them each while exporting to re-save.

So opening is one thing but I need a couple of columns out of the first sheet and there's some junk at the top that I don't need.

Package is xlsx (pdf). I had just installed Python (thought I had it at work, but I guess not) and was preparing to do it there, but then I was thinking that surely R would have a way.

The issue in my case with my work computer is that it's a 64-bit, but I have 32-bit java, 32-bit office, etc. I was running R in RStudio as 64-bit. I tried to do everything to get Java updated and 32-bit (but still not killing other things I needed). Finally - duh- just pointed RStudio at the 32-bit version of R installed and then RJava ran just peachy and it's required for xlsx.

Here's the command I'm using so far:
file<-read.xlsx("analysis.xls",1,startRow=4, colClasses=c("character", "numeric"))

1 is the sheet index. The colClasses is pretty slick, too. You can also name sheets and only grab certain columns.

So now to iterate through all the files in a directory, opening, counting up the articles listed for the categories I have in another vector and reporting that out. Should be cool. Or not.  And I promised to have the results - however they're done - Friday and I'm going to be out tomorrow. Back to it!

One response so far

Trying another CAQDAS, MaxQDA

(by Christina Pikas) Jul 12 2014

CAQDAS: Computer Assisted Qualitative Data Analysis Software (more).

Previously, I'd used NVIVO and I found it to be miserable and horribly expensive (I didn't pay, but I would have to this time). I really did most of my work offline with colored pencils, note cards, etc. I am fully willing to admit that I didn't know how to use it or that maybe I was doing it wrong, but instead of saving me time, it was costing.

I started trying to use it with my dissertation data and ew. It's fine with my interview transcripts, but tweets and blog posts, just yuck. So then I started coding just using Excel and meh.

So back to the drawing board. I read a bunch of reviews of different products on Surrey's pages: , but it's really hard to tell. I also started looking at prices and oh la la!  I was thinking maybe dedoose, despite their earlier issues, but that's like at least $10/month. Not bad until you think you might do this for a while.

After all that MaxQDA - a German product - seemed like a slightly better choice than some. The student license doesn't expire and is for the full product (they have a semester one, but that doesn't make sense for me) - that's probably the biggest selling point.

So far so good. Importing about a thousand blog posts as individual documents was no big deal. Interview transcripts were even quicker. Adding in my codes was super quick and it was super quick to move one when I stuck it in the wrong hierarchy. I think I'll work with this data for a bit before doing the twitter archives - particularly since I don't know how I might sample.

I'm still on the 30-day trial. Not sure if I need to try to start paying for it with a week or so to spare so the verification can be completed. My university doesn't have an expiration date on our IDs. Not sure if my advisor has to send an e-mail or what.

Yeah, there's something for R (of course), but it doesn't have the features. I was still thinking I might do some more machine learning and other tricks with my data using R which is easy now that it's all in spreadsheets.

One response so far

Quick note: Now on GitHub

(by Christina Pikas) Jul 12 2014

Scripts mentioned previously are now on GitHub with an MIT license which should hopefully be permissive enough. I can't say that anyone would want to use these, but this also backs up my code even better which is useful.

I'm using RStudio so probably if/when I do more analysis in R, I'll just use Git from there.

The url is, unsurprisingly:

2 responses so far

Getting blog post content in a usable format for content analysis

(by Christina Pikas) Jul 04 2014

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:
 UserAgent <- "pick something"
 temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
 nameout <- paste(fileName, ".htm", sep="") 
 write (temp,file=nameout)

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("","archivep1" )

More often I did a few:


So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

    soup = BeautifulSoup(open(filename))
    posts = []
    for post in soup.find_all('div', class_='post'):
        title = post.find('h3', class_='post-title').text.strip()
        author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
        content = post.find('div', class_='post-body').p.text.strip()
        date = post.find_previous_sibling('h2', class_='date-header').text.strip()

        posts.append({'title': title,
             'author': author,
             'content': content,
             'date': date})
    #print posts
    posts = str(posts)
    posts_file.write (posts)

# this is from

for filename in os.listdir("files"):

print ("All done!")


So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

No responses yet

Older posts »