Archive for the 'Information Science' category

C&RL Vote for articles - write in campaign for Taylor!

Apr 22 2014 Published by under Information Science

Ok, so the only thing I'm doing for this campaign is posting here, but anyway.

College & Research Libraries is having a 75th anniversary special issue and they're asking readers to vote for the best articles: http://www.ala.org/acrl/publications/crl75 .

I don't know about a lot of the choices. I mean FW Lancaster (framed pdf) FTW, of course! And Kilgour, probably. Plus the anxiety one by Mellon (framed pdf) has definitely had an impact.

BUT they forgot the best one ever:

Taylor, R. S. (1968). Question-negotiation and information seeking in libraries. College & Research Libraries, 29(3), 178-194. (pdf)

Luckily there's a write in block. So you know what to do... write it in!

 

No responses yet

Post I would like to write: New trend for linking external and internal information

I first noticed this in chemistry but now I'm seeing it in engineering, too. Major publishers (?), content vendors, indexes (?) are starting to offer services whereby your company can use their search tool to index your local content and display it side by side with their content OR a way to use their api to pull their content in to your local tool.

That's a common complaint in research labs and in companies. Knowledge management of internal information gets funded and defunded and is cool and then not cool... External information is familiar to people coming out of school... how can you search across both.

We have the artist formerly known as Vivissimo (now IBM something spherical I think) as an intranet search, and I would love to see it use our discovery layer type search as a tab. I don't see why it couldn't.

This deserves analysis and thought - no time. sorry!

2 responses so far

A word on ebook iPhone app usability

Aug 29 2013 Published by under Information Science

I'm not a usability expert although I certainly have read a bunch and seen a bunch of presentations (and know a few experts personally), but there are some basic ideas about understanding your user and the tasks they have to perform with your app or device or site that should be somewhat obvious.

I often read books and articles on my iPhone while nursing/rocking my babies. Maybe it makes me a bad mother but it sure has helped with patience over the past almost 18 months! If they're awake and up to shenanigans, I put the phone away and give them my full attention... but anyway. People are shocked and amazed that I can put up with reading a book on my iPhone. I'm not sure why - it's not a tiny font, I can make the font whatever size I need. I have the phone with me anyway. I don't need a separate light source. I can get new books right there instead of having to connect it to my laptop.

One of the things that is super, super important for an immersive reading experience is the ability to quickly turn pages - without even thinking about it and without losing your train of thought. When you're reading on a small screen, you might have like 4 page turns to every one you would have with a print book so it's something you do a lot. (particularly if you're reading <ahem> trashy bodice ripper romances <ahem> that read very quickly!)

Overdrive is the only app you're supposed to be able to use with the Overdrive license my local publib has. They have made two major mistakes with page turning - it's like they don't really get it? First, a while ago they added automation when you turn a page. So it would look like the corner turning up and going over - what a colossally bad idea! No one turns pages because it's cool - you turn pages to see what happens next. They quickly reversed that and made it an option. In the most recent update they've added a bunch of cool things like synching across platforms (good), but they've now made it a swipe instead of a tap to turn the page... and you can't even swipe from the side because that opens a menu, you have to swipe in the middle... which is hard to do one-handed while holding the device. And it's slow... it has to think about it before turning. So then you have to go back and check what was happening and then go forward again... I had a book on there that I had had on hold for a while and I just gave up on it. I'm going back to reading about Web Corpus Construction in a pdf reader like Good Reader.

Update: This afternoon Overdrive released a new version that fixes the page turning issue. I can only hope that they learned from it this time when they didn't learn from it last time.

One response so far

A local way to create a word cloud

Jul 03 2013 Published by under information analysis, Information Science

There's a lot of debate about whether word clouds are useful visualizations, but they do give you some insight into the prevalence of words in some amount of text.

Wordle is a way to do this online but you have to paste your text into a box or provide a url to an RSS or ATOM feed. So that won't work for some content because of sensitivity and it also won't work for batches of files. Here's a way I stumbled upon when I was trying to do some other analysis/visualizations.

Needed:

  • R
  • packages: tm (and all of its many dependencies), wordcloud, RColorBrewer (optional)
  • data

I originally had a 470 page pdf that I wanted to get a feel for. It was divided up into sections, each on a specific program. I ended up saving out each section as rtf by hand although I could have done it programmatically in some other program. I tried saving directly to plain text but that failed. I then used DocFrac to batch convert them to plain text (probably not necessary, but I wasn't sure). So then I had a directory of plain text files.

Load into R the whole directory:

thingy <- Corpus(DirSource("plaintext"), readerControl = list(reader = readPlain()))

make a copy in case of emergencies, but mostly to use at a later step:

thingy.orig <- thingy

Then I did a bunch of clean up tasks pretty typical for this type of work (there is an easier way to do it but this way works. I didn't take into account the order of doing this, which I probably should have)

#remove stop words
exchanger <- function(x) removeWords(x, stopwords("english"))
thingy<- tm_map(thingy, exchanger)

#stem
exchanger <- function(x) stemDocument(x)
thingy<- tm_map(thingy, exchanger)

#lower case
exchanger <- function(x) tolower(x)
thingy<- tm_map(thingy, exchanger)

# remove numbers
exchanger <- function(x) removeNumbers(x)
thingy<- tm_map(thingy, exchanger)

# using grep sub to remove punctuation
# I used this instead of the built in because I wanted it to leave a space where the punctuation had been
exchanger <- function(x) gsub('[[:punct:]]', ' ', x)
thingy<- tm_map(thingy, exchanger)

In addition to the regular stop words, I removed a list of words that were particular to this text.

#custom list
mystop <- c(" "," ")
exchanger <- function(x) removeWords(x, mystop)
thingy<- tm_map(thingy, exchanger)

Next you should take a peek at at least one specific document.

inspect (thingy[3])

And you can add more terms to the custom list or do a gsub or whatever.

Then you need to build a term document matrix.

# create term document matrix
dtm <- DocumentTermMatrix(thingy)

And it needs to be in matrix format. I understand there's a limit to how big these can be. There are some options to make them smaller.

dtm.mat <- as.matrix(dtm)

Then you need to get a vector of all the words and it's nice to have them in order of decreasing frequency. Plus you need the frequencies.

v <- sort(colSums(dtm.mat), decreasing=TRUE)
myNames <- names(v)

Then you pick your colorbrewer color scheme and draw your cloud:

pal <- brewer.pal(12,"Paired")
#add in seed to make it reproducible
set.seed (300)
wordcloud(myNames,v,c(8,.3),7,,TRUE,TRUE,.15,pal)

Now here's the issue. Our words are still stemmed. Well, there is a stem completion function which goes through and finds the most prevalent complete version of that word in a dictionary. In the examples, they often use the original corpus as the dictionary... the problem is that it's very, very, very slow. In fact, it can take days. I have some options I'm considering to get around this:

  1. try the wordnet package
  2. use python nltk to lemmatize prior to doing the r stuff (lemmas are real words unlike the stems)
  3. change the script to make it not iterate so much

Stemcompletion ran for a week on my computer before it was forced to reboot from an update... I haven't had a chance to try the other methods yet.

One response so far

Instrument bibliographies, data citation, searching for data

Jan 11 2013 Published by under bibliometrics, Information Science

My place of work builds spacecraft and instruments that fly on other folks' spacecraft. So one of the things that we need to do is to come up with a list of publications that use our data. It's the same thing with telescopes and it ends up being a lot more difficult than you might expect. There are a few articles on how to do it from ADS staff and from librarians from ESO (Uta Grothkopf and co-authors), STSCI, and other telescopes. It turns out that you really have to do a full text search to get closer to comprehensive. ADS has a fulltext search covering some things, but I asked the experts in the Physics-Math-Astro Division of SLA and many of them also use the fulltext searches on the journal publisher pages (which are of varying quality). I found that Google Scholar was the only thing that got book chapters. This is all pretty complicated if your instrument has a pretty common name or a name that is a very common word.

Other suggestions were to use funding data from Web of Science or elsewhere (soon to be part of CrossRef data), but that really only gets the science team for the instrument. Our main point is to find out who is downloading the data from the NASA site (or elsewhere?) and doing good science with it.

Heather Piwowar has done a lot of research on data citation (is it done, how do you find it), but I believe mostly with life sciences data. Joe Hourclé has also presented several times on data citation and there is the DataCite organization to work on this issue. But this is all future stuff. Right now it's the wild west.

No responses yet

ASIST2012: Other random sessions

Oct 31 2012 Published by under Information Science

These are random notes from the sessions I attended Sunday. I need a new laptop so I didn't bring my tired old one to live blog - these are from my scribbled notes on paper.

How much change do you get from 40$ - Erik Choi - this was a typology of failures in social q&a. The system offers some suggestions for how to do better questions but I think their intention was to use this research to help people ask better questions. As Joe Hourclé pointed out in questions - Stack Exchange supports query negotiation/refinement but they're looking at what to do with Yahoo, which is the most popular and has a lot of failed questions. Their big categories were: unclear, complex, inappropriate (prank, awkward...), multiquestion.

Dynamic query suggestions - dynamic search results - Chirag Shah. This was looking at google's way of showing you results as you type and also offering search completions as you type. Google says it saves 2-5s per search, but they wanted to test it. They did it in a laboratory setting with 3 conditions - neither, only autocompletion, all. They gave a task asking users to search for information on the velvet revolution and other revolutions and they looked at the number of pages viewed, concepts (noun phrases?) used, eye tracking. The dynamic stuff didn't change the number of concepts in a query, the queries were shorter but not necessarily better.

How do libraries use social networking software to communicate to users - they looked at big libraries in English-speaking countries and "greater China" (Taiwan + Hong Kong + PRC). They looked at the posts and interviewed a librarian from each. Some discussion afterward how Weibo is better at supporting conversations than Twitter - it would almost have to be :)

Barriers to collaborative information seeking in organizations - I'll have to read this paper... he spent too much time on methods and really cut his results section short.

No responses yet

Clustering articles using Carrot2

Sep 11 2012 Published by under bibliometrics, Information Science

I did a very basic intro to using some social network analysis tools for bibliometrics here. This post will also be a brief how I did something for people with similar skills to mine. In other words, if you're a computer programmer or like, this will be too basic.

I've been getting into information analysis more and more at work. I've tried to learn this stuff and keep up as time goes on because I know that it's a very useful tool but it's taken this long to get any real uptake at my place of work. Typically I'll use metadata from either WoS or Scopus or from a collection of specific research databases filtered through RefWorks. Once I have the metadata from the research database I'll often run it through Vantage Point for some cleaning and to make matrices (co-authorship, author x keyword, etc). More recently, I've been using Sci2 for some of this.

All of the tools I use so far work with metadata but I get a lot of calls for doing mining with the text. I do know of tons of tools to do this but I think they all take a little more commitment than I'm willing to give right now (learning to program, for example). Some things can be done in R, but I really haven't tried that either as there is still a steep learning curve.

Anyway, a software developer (well he's really a lot more than that - he does rapid prototyping for human language technologies) buddy of mine from work has recommended Carrot2 a bunch of times. I now have a project that gives me an excuse to give it a whirl. We're mapping an area of research to pick out the key institutions, key authors, key venues... but also looking at the major areas of research. This could be done with author co-citation or bibliographic coupling, but another way is to cluster the articles based on their abstracts - I used Carrot2 for this. A reason not to use Sci2 with WoS data to do ACA or bib coupling is that for this particular research area I was having a very hard time getting a nice clean tight search in WoS where as some social sciences databases were yielding great results. As I was just telling a group at work, a lot depends on your starting set - if you do a crap search with lots of noise, then your bibliometrics aren't reliable and can be embarrassing.

Carrot2 out of the box will cluster search engine results from Bing, Google, and PubMed. If you download it, you can incorporate it into various programming thingies, and you can also use the document clustering workbench on any xml file or feed (like rss). They have a very simple xml input format and you use an xslt to get your base file or feed to look like that. I exported my records from RefWorks in their xml and I started reading up on XSLT...after some playing around I had an epiphany - I could just make a custom export format to get the input format directly from RefWorks!

I started from the RW xml but could have gone from scratch.In the output style editor, bibliography settings:

reference list title: <?xml version="1.0"  ?>\n<searchresult>\n<query></query>\n

text after bibliography: </searchresult>

Then I only defined generic and all the ref types use that:

refid precede with <document id="

follow with ">

Basically do the same for title, primary; abstract (call it snippet); and url

Then add text to output: </document>

You end up with

<?xml version="1.0" ?>
<searchresult>
<query>paste your subject here, its supposed to help</query>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
</searchresult>

More or less, that is. I had some spaces I needed to remove. There was also one weird character that caused an error.

Then in Carrot2 workbench you select XML and then identify the location of the file and shazaam! You get 3 different visualizations and you can export the clusters. One of my biggest was copyright Sage but it can be tuned and you can add to the stopword list if you want. I still want to play with the tuning and the different clustering methods.

 

No responses yet

Re-post: Commentary on: The persistence of behavior and form in the organization of personal information

Feb 10 2012 Published by under Information Science

This was originally posted on my blog on November 17, 2007. Deborah Barreau passed away this morning from cancer. There's a lovely message from Gary Marchionini on asis-l.

---

This post is a review and commentary on: Barreau, D. (2008). The persistence of behavior and form in the organization of personal information. Journal of the American Society for Information Science and Technology 59, 307-317. DOI: 10.1002/asi.20752

Goal: Barreau re-visits her 1993 study (published in 1995) in which she interviewed seven managers to determine how they manage electronic documents. In particular, in her 1995 study, her goal was to examine how Kwasnik's (1991) dimensions of organization of print materials translated into the electronic domain. In this study, her goal is to learn what has changed in the more than ten years and what impact new technologies have had.

Methods: Her sample consists of 4 of the 7 managers interviewed in her earlier study. She asked the participants broad questions on what personal information they have in their office, how they got it, how they organize it, and how they find things in it. She also asked what changes they would like to see in the technology.

The responses were coded using Kwasnik's dimensions. No information is provided on how the interviews were conducted and how the coding was actually performed. There are mentions of transcripts and notes, however. A sample of the statements were "double-coded" and an intercoder reliability check was done. (I almost missed this bit because the html is a bit goofy to read)

Results: I will just pull out a few interesting points here.

  • participants saw their intranet as an extension of personal space when they had bookmarked or used send to desktop as a link to keep some information.
  • they bookmark stuff and then never use it
  • participants were split between keeping a clean e-main in box by acting on or deleting things immediately and reporting that their e-mail was out out of control
  • retrieval is through browsing an ordered list

Changes they would like to see: synchronized single sign on

Conclusions: Many things remained the same. The way the managers name files, and use catch-all directories were two things in particular. Some things that have changed include the extension of the personal space to include bookmarked things from the web and the sheer number of different systems required to do the job. New dimensions are suggested to update Kwasnik's listing.

Commentary: My immediate reaction to this article was very positive -- mostly perhaps because it resonates with my own findings (Pikas, 2007). More information on methods is required to adequately judge the validity and transferability of this work.

She makes the point that corporations need to do better to back up user's work. This is something that also came out in my study. It could be that the corporations *are* doing a good job of backing information up but are not *communicating* well enough so that users trust the backups.

She also makes the point that organizations need to do better with e-mail. First, for records management purposes, they should discourage the retention of older e-mails. I strongly, strongly disagree with this. Much valuable information is included in e-mails - only in e-mails - and there should not be an arbitrary retention policy requiring their deletion if the user finds them useful (yes, I do know about e-discovery, but if you're not doing anything wrong- I guess I'm naive). Second, she states that organizations should do something about advertising e-mails received (ok, this is fine), about broader distribution lists than are required for the job (ok, I was getting e-mails in Maryland once for things lost and found in the Philadelphia office- so this is clearly a management issue), and about too many interruptions. I disagree about the interruptions truly being a something that the organization as a whole can/should fix through rule making. This article speaks to me that more training is required on the effective use of e-mail and IM. Perhaps the users should employ a do not disturb message on IM and log out of e-mail if they are working on an intensive task.

This is my first use of the BPR3 logo so I would be happy to take comments on that (or complaints if I'm not doing it right!)

---
Barreau, D.K. (1995). Context as a factor in personal information management systems. Journal of the American Society for Information Science and Technology , 46(5), 327-339. DOI:10.1002/(SICI)1097-4571(199506)46:5<327::aid-asi4>3.0.CO;2-C

Kwasnik, B. H. (1991). The importance of factors that are not document attributes in the organization of personal documents. Journal of Documentation, 47(4), 389-398.

Pikas, C. K. (2007). Personal information management strategies and tactics used by senior engineers. Proceedings of the Annual Meeting of the American Society for Information Science and Technology, Milwaukee, WI , 44 paper 14. (This will be made available open access 90 days after the conference)

Labels:

No responses yet

Research Database Vendors Should Know

Research database vendors - the ones who design the interfaces that the end users use - should know that data export is not a trivial addition. Rather it is an essential part of their product.

Over and over and over again, librarians complain about one interface that works one day and doesn't work the next. The one that doesn't output the DOI unless you select complete format. The one that all the sudden stopped exporting the journal name. The interfaces that don't work with any known citation manager. The ones that download a text file with 3 random fields instead of direct exporting the full citation and abstract.

But you blow us off and you act like it's not important.

Well. I was just talking to a faculty member at another institution - even though a particular database is most appropriate for her research area and she finds interesting papers there, she now refuses to use it because it doesn't export to EndNote right. She's tired of the frustration and she is tired of finding that she has to edit everything she's imported so she's just given it up.

Once again librarians are telling you something and you need to listen. Researchers and faculty are super busy. They will not keep using your product if it makes their life harder. If they don't use your product then we'll stop subscribing. That's all there is to it.

2 responses so far

It’s not programming, it’s not in my job description, but it took a lot of time these past 2 weeks

Aug 07 2011 Published by under Information Science

I didn’t participate the library day exercise this time – that’s when you document a day in the life in your job as a librarian so that others can see what it’s like to be a librarian.  I have a somewhat non-traditional job so my normal day in the life isn’t like the normal day in the life for most other librarians. It struck me earlier this week – as I got frustrated with things going wrong on 3 fronts at once – that I’m not even sure how to describe the work I’m doing.

Project one: we have an intranet search service that goes way beyond the standard appliance you might purchase. In addition to connecting to our document repository, sucking in the index from our web crawler, and indexing our SharePoint installation, it has an “expertise” tab. This is an index of custom compiled profiles for our employees with information pulled from their resumes, their MySites (a part of SharePoint where you can have a profile), the corporate directories, and, more recently, internal grant submissions and social network participation. One of the obvious things that’s missing is a listing of the external research articles the employees have written. For various complicated reasons, another librarian and I maintain the most complete listing of these documents.  We take alerts from all of our various databases, import the records into RefWorks, and then export them in a custom built Movable Type format which we then import into our listing which is on a MT blog. The tool we’re using can’t just index the web listing, as the author names are not pulled out and we need to link the articles to the directory IDs so that they turn up in the right profiles. The obvious thing was to just export the RefWorks records in some tagged or XML format. The only thing is that this would show the citation, but not provide any help on getting to the full text (we have a very hard time marketing our services). Great, so you can’t update an export format but you can make a custom bibliography with an open URL link. Anyway, to end this long description, I had to completely make an export format from scratch so that another librarian could import it in to Excel, run a script to parse out authors and match them to the directory ID (most of the time), and then upload them to a SharePoint list (ew.)

Project two: my larger institution has been working very hard on migrating to a new interface to our catalog. This runs on Blacklight, an open source effort led by UVa. It’s fabulous and we’re all very excited about it. Unfortunately, this means that other tools that made calls to the old interface will have to be changed. This includes Z39.50 services and things like LibX. If you aren’t aware of LibX, it’s a fabulous browser add-on that adds cues to bookstore websites so you can see if the book you’re viewing is available at your library; hotlinks PubMed IDs, DOIs, ISBNs, ISSNs so that you can click on them to see if the resource is available from your library; lets you reload a page through your proxy server for off-site access; and lets you search for highlighted things on the page in your catalog or other services you’ve added.  Ok, so obviously my larger institution’s LibX needed to be updated. I’m the only one left who knows how (although I should have been training 2 other people but have not  - not their fault, but mine since I’ve been busy) and I’m the primary maintainer. Mostly because I volunteered. Anyhow, I was totally baffled by the edition builder for a while, but then I was able to see what UVa did with theirs and then edit that. Basically I tried to take theirs and then substitute in any changes I knew about and then sent the information to our real programmer (who has been slammed with work) to see if I was close. He made a couple of suggestions and we were off to the races… except for … CRAP! It uses an OCLC service that needed to have our institution’s registry updated. Neither of us knew who was *supposed* to fix that, so I put in a couple of tickets and changed my lab’s registry information while the programmer changed the larger institution’s registry. And crap again, because then it came back thinking we could only ever search one ISBN at a time which is not true – you can OR a large (if not infinite number). Finally, I got that semi-fixed (it now works for up to 25 ISBNs coming back from the xISBN service). We’ve tested it and I pushed it out live on Friday.

Project three: we’re updating our internal portal page for information services – we had planned to suck in listings and descriptions and what not from the services offered by our parent institution… but ARGH, the site is being built in some version of SharePoint and of course it’s acting all wonky. Even embedding a catalog search is turning out to be a hassle.  So I started listing resources and building resource guides – which is actually a very typical job for a librarian… next job is to help the people figure out how to embed these services even though a) I’m not a programmer b) I don’t know anything about SP and c)I have other stuff I need to be doing

Project four: I’m embedded in a team in a sponsor-facing department working on a distributed knowledge management system that’s running on Semantic Media Wiki. Unfortunately the two members of the team with whom I work most closely have been completely pulled off for a few weeks to work on another project so that leaves me. So I’ve learned how to write these hugely complicated nested queries using arraymaps and templates to display results. Once again, it’s not programming, but it’s not literature searching either. It doesn’t seem that difficult … but it was, for me. I wish I could show this off, maybe eventually once it’s delivered.

Project five: It started with standard scientometrics, but I was working with an hci/visualization expert who prototyped a new system for exploring and visualizing connections, etc. I had already done a lot of data cleanup and visualization, but he needed data in a different format. So this is another example of me messaging data export from various tools (Sci2, VantagePoint, and originally WoS) for import to another system.  We also did a proposal for an internal grant for next year to continue working on this so that took some time.

 

I think there were more, but these were the ones that struck me. It’s not programming, it’s really messing with settings with existing products, but that doesn’t seem to really capture the complexity or frustration. Oh, and in the middle of this my institution changed over to default to the new RefWorks interface – good – so I had to redo the tutorials (only got one done). About 18 hours after I did the update and announced it, the interface changed again so I needed to redo a few screenshots… sigh.

One response so far

Older posts »