Wednesday, January 23, 2013

Struggling with R

I used to have problems understanding large volumes of data. Then I installed R. Now I have more problems.

This evening I took a bus over to Kent Ridge, where the campus of the National University of Singapore sits. I was there for an introductory course on R, which make my brain melt and try to fall out of my ears, and now I'm sat very, very still on my bed, waiting for my grey matter to solidify again so I can walk around without it evacuating itself.

The course on R was taught by a very enthusiastic chap, but unfortunately that enthusiasm meant he overlooked some of the flaws of R. For a start, there are at least 4 different ways to ask R for help on a specific function. It wouldn't be so bad if they gave different sorts of help, but if ?something and help(something) both tell you exactly the same things about something then it seems like there's just more complexity for people. I suppose if half the people had keyboards that were missing the question mark key, and the other half didn't have any parentheses, then it might have been helpful, but if we had computers that bad then we shouldn't have been mucking about trying to learn statistical packages.

R is the Swiss Army Knife of statistical packages, if a Swiss Army Knife had 600 different blades, half of which do exactly the same thing as the other blades, but to use them you have to waggle the knife with a particular set of unique gestures that are documented in mind-bending detail somewhere else. I like the idea of it being very flexible and powerful, but the trouble is that if you're trying to learn it, having a multiverse of different, almost indistinguishable ways to perform an abstract task, it's very hard to actually retain much. I wanted somebody to show me how to light up das blinkenlighten, and then take me back how that worked, rather than spend 90 minutes demonstrating the many and varied ways that you can populate a vector.

So that was a bit frustrating, especially as we kept seeing glimpses of things that you could do, like using sentiment analysis on Twitter to figure out what people thought of the current by-election. (Somebody else has used R to mine all the diplomatic cables released by Wikileaks, which led them to the rather obvious conclusion that Saddam Hussein was less popular with American diplomats when there was an actual war being run against Iraq, but at least there was a pretty graph.) Instead of looking at those in detail, we had to make do with more abstract things. It was a bit like if you wanted to be a Olympic sprinter, and you ended up in a class run by Bertrand Russell that tried to prove the existence of shoelaces from first principles.

But hope springs eternal. I still think that if I plug away at this for a while, I may get something useful out, and if nothing else, I learned tonight about the Big List of R's Task-oriented views, which contains lots of Natural Language Processing that I didn't know about, and which in turn has given me the idea for my own Twitter-based sentiment analysis, where I'll go and find the most popular things on Twitter on any one day, and then say something contrary.

Yes, I know it's quicker in the short term to just go on Twitter and say something contrary, but using lots of computing power and complex statistical equipment means I could be aggravating people on the internet 24-7, and that's got to be worth something, right?


Post a Comment