Thursday, February 21, 2013

Fiddling about with words

Over the last two years, I wrote an awful lot of words about the 23 Bond films that had been made up until the end of 2012. (We don't count the David Niven Casino Royale, or Never Say Never Again, because they were, well, shit.) Inspired by Jon Millward's Deep Inside, a statistical analysis of porn stars, I thought I'd do some data mining of my own. Unfortunately, I'm still not very familiar with R, that programmatic workbench beloved of statisticians, so to get my feet wet I thought I'd start by analysing and displaying the frequency of different words in what I wrote. And here's what you get:

This is just the Brosnan and Craig films, but as you can see there's two words that come up a lot more often than any others: film (unsurprisingly) and theres, which did surprise me a little.  (Part of what happens when we analyse the corpus of words that I've churned out is to eliminate extraneous punctuation, which I think is why we see theres rather than there's.)  Perhaps I need to watch out for my overuse of the phrase there's in future.

Being lazy, and possibly rather too pleased with myself (as anyone who starts text mining their own output probably is) I then fed all the other Bond reviews I'd written into R, and regenerated the word cloud, which now looked like this:
Again, you'll have to zoom in to see the detail, but it appears that "women" and "woman" feature more frequently in the earlier films, and with the onset of Brosnan and Craig, things went a bit more masculine than before.

Or I'd got sick of writing about Bond girls.

There's a few odd things showing up - it looks like 'Goldfinger' was a word that cropped up time and time again, but it's it worrying that 'salaryman' appears to be more common than 'happy'? Does that say something about how my mood deterioated, or is it just that I didn't clean the rubric "An English salaryman surveys his..." from the file before submitting it for analysis?  Ah, so many questions...

But this is only a start.  There's some funky things within the worldcloud package for R, that should make it possible to compare multiple data sets, which means I should have a graphical way to display how I got bored of Sean Connery over time, and grew more forgiving of Roger Moore.  Perhaps.  Blogalongabond now seems like a strange, quasiunbelievable dream.

I'm doing less than scratching the surface, right now - the most I've done is to copy some existing code from here, and then batter it around until it absorbs the Bond wordage and spits it out the way I want.  What I really wanted to do tonight was mash together crime and weather statistics, but sadly the FBI don't make a freely available breakdown of crimes by day that I could join to weather data, so sadly I'm a way off being able to show that rainy days make people kill one another.  Perhaps this is a start, though.

If you're interested, here's the word clouds for Dr No and Skyfall, respectively:

Dr No looks strangely like a pink hand grenade if you squint
Whereas with Skyfall, can you spot the words "complain", "saville" and "egg"?

0 comments:

Post a Comment