Sunday, June 27, 2010

Playing with language - the technical gubbins

I'm really enjoying using the command line today to do things.  I spend most of my time at work using Windows to do things, and for most spreadsheet work that's ok (I don't want to draw a graph from the command line, thanks very much) but if you want to do something iteratively, often a point and click interface isn't the best way, once your desired workflow is a bit different to what the program thinks it should be.  Plus, Linux comes with lots more free tools that are easily accessible, once you don't mind getting your hands a bit dirtier.  If mucking around with scripts isn't your thing, don't read any further.

So, first thing this morning I installed curl, then wrote this script:

    for ((  i = 6 ;  i <= 2794;  i++  ))
    do
      curl http://www.yourwebsitehere.com/content/page.php?id=$i >> content_base
    sleep 15 
    done
That sits in a terminal window and chunters away to itself, every 15 seconds going off to yourwebsitehere.com and collecting the next page.  I'm trying to be polite and not hammer the site, so I take a pause and count to fifteen between page requests.  That makes the process a bit slower, but I can let it sit here and chug along all day, because I don't have to click 'OK' on a dialogue box for each of the 2789 occurrences.  Consider for a moment how dull it would be to click "Save As ..." for each of those pages...

This gives me an enormous file, content_base, that is still full of lots of html tags.  They'll get in the way of my word frequency counts later on, so we use grep to clean them out:
    grep -v '\.blockhead' content_base > clean1

    grep -v 'font-family' clean1 > clean2
Each time, grep will take all the lines that don't include one of those phrases and output them to the file at the end of the command.  There's probably a cleaner way to do this where it doesn't create all these intermediate files, but I was too dumb this morning to figure it out.  (I did try grep -v 'html' clean1 >> clean1 but that didn't turn out to be so clever; I ended up appending almost all of clean1 back to itself, again and again and again until I killed that process.)

This gave me a lot of intermediate files I didn't really have any need for, but because I'd been giving them a logical naming structure (start at clean1 and increment until you get to cleanaleph-zero) I could just write another shell script to get rid of them:

    for ((  i = 1 ;  i <= 26;  i++  ))

    do

    rm clean$i

Tidy!  And because I'm lazy and forgetful, I put this at the end of the shell script I use to generate the intermediate files, so I don't have to worry about remembering to run it after I've run the cleaning processes.

And then I realised I had a whole lot of <p> tags that needed getting rid of too.  Grep wouldn't help here because I'd exclude the whole line, and every line of interesting content has <p> at the start too.  Never mind!  I can use this single line of perl to zap them all:
    perl -pi -e "s/<p>//g;" clean27
If I'd wanted to replace all the <p> with "I'm a big hairy elephant" then I'd have just done
    perl -pi -e "s/<p>/I'm a big hairy elephant/g;" clean27  
... although I had no need for that this time round.

Now I have a big file full of words.  You can use wc to count them up, and tr to fiddle around with them, but both of these are a bit difficult to handle, because although this script will give you all the words, along with their frequency, it orders them alphabetically, whereas I want to see the most frequent words at the start of my file.
    tr ' ' '

    ' < clean28 | sort | uniq -c > wordcount
So instead I installed wf, a word frequency tool, from http://www.async.com.br/~marcelo/wf/   Which cost nothing, and which I could set up in a couple of minutes (once I remembered that I needed to give the installer permission to write to the directory that it lived in.

Then you can start to look for common words (-i gives us case insensitivity (yay!) and -s sorts the output, largest to smallest):
    wf clean28 -s -i > wordcount
At which point you'll find that there's still a lot of garbage, or at least words that are too common to be interesting:

 9214 the
 7579 i
 6230 to
 5479 and
 5253 a
 4425 of
 3814 that
 3629 it
 3186 was
 3126 in
 1992 t
 1887 my
1799 is


Unsurprisingly, because it's a personal blog, the word I is very frequent. But other words like and, the, it, was, etc are not going to be very helpful for figuring out what the most important content is. We'll need to build an exclusion list to deal with them.

2 comments:

Anonymous said...

I have no idea what this post is about, but I am very impressed nevertheless.

Mr Cushtie said...

Ah, it's part of my 'Enry 'Iggins scheme to teach a computer to speak proper, like. This is the unglamourous boiler-room stuff - soon the shiny fun output should appear, and all and sundry will be impressed by my technological wizardry.

Post a Comment