I had a little slow period a month or so ago at work (not anymore, for sure!) and I decided it was time to start working on a goal I had set for myself for the year: learn to do some analysis that actually uses the full text of the document vs. just the metadata. Elsewhere I have discussed using Sci2, VantagePoint, bibliometrics, and actually Carrot2 (using the text of the abstract), but I need to go further. I don't aspire to become an expert in natural language processing (NLP) but there are some times I end up having to stop before I want to because I just don't know how to go on.
Anyhoo... first step was to see what I could do in R using the TM package and whatever else. I figured out how to do a word cloud but meh on some of the other tm stuff. I tried a little LDA but my corpus didn't work well with that. When doing the word cloud I realized I really wanted to lemmatize instead of stem. I looked around for ways to do it in R, and there is a WordNet package for R (thanks Greg Laden for pointing it out!) but it just wasn't doing it for me. I had recently worked my way through a bunch of the Python lessons on Code Academy and also bookmarked nltk - the natural language toolkit that works in python so I thought - ah-ha!
The first big deal was installing the stupid thing - language. Argh. I started with Eclipse and PyDev but alas, I am so not willing to figure out how that really works. I got one sample program running but next program it kept running the first one so meh.
I started working my way through the nltk book, and that used the shell mode I guess? where you get immediate responses? Installing packages - I never did figure out how to do that in Perl - it's easy in R but alas... so I gave up on PyDev and installed ActivePython which has a handy-dandy package installer which lo and behold works for people like me who only know enough to be dangerous.
The other things I'm learning: holy cow ignore what your computer is and do everything 32 bit for the love of chocolate. A bunch of problems from installing 64 bit where everything is looking for 32 bit. Uninstall and try again.
I still haven't figured out how to use the programming environment (?) that ships with ActivePython. I really like how RStudio completes things and that's why I wanted to use Eclipse. I'll have to try that next.
Anyway, I hope to take some notes and leave them here for my future recall as it's easy to forget how things worked.