So, first thing this morning I installed curl, then wrote this script:
for (( i = 6 ; i <= 2794; i++ ))
do
curl http://www.yourwebsitehere.com/content/page.php?id=$i >> content_base
sleep 15
doneThat sits in a terminal window and chunters away to itself, every 15 seconds going off to yourwebsitehere.com and collecting the next page. I'm trying to be polite and not hammer the site, so I take a pause and count to fifteen between page requests. That makes the process a bit slower, but I can let it sit here and chug along all day, because I don't have to click 'OK' on a dialogue box for each of the 2789 occurrences. Consider for a moment how dull it would be to click "Save As ..." for each of those pages...
This gives me an enormous file, content_base, that is still full of lots of html tags. They'll get in the way of my word frequency counts later on, so we use grep to clean them out:
grep -v '\.blockhead' content_base > clean1Each time, grep will take all the lines that don't include one of those phrases and output them to the file at the end of the command. There's probably a cleaner way to do this where it doesn't create all these intermediate files, but I was too dumb this morning to figure it out. (I did try grep -v 'html' clean1 >> clean1 but that didn't turn out to be so clever; I ended up appending almost all of clean1 back to itself, again and again and again until I killed that process.)
grep -v 'font-family' clean1 > clean2
This gave me a lot of intermediate files I didn't really have any need for, but because I'd been giving them a logical naming structure (start at clean1 and increment until you get to cleanaleph-zero) I could just write another shell script to get rid of them:
for (( i = 1 ; i <= 26; i++ ))
do
rm clean$i
Tidy! And because I'm lazy and forgetful, I put this at the end of the shell script I use to generate the intermediate files, so I don't have to worry about remembering to run it after I've run the cleaning processes.
And then I realised I had a whole lot of <p> tags that needed getting rid of too. Grep wouldn't help here because I'd exclude the whole line, and every line of interesting content has <p> at the start too. Never mind! I can use this single line of perl to zap them all:
perl -pi -e "s/<p>//g;" clean27If I'd wanted to replace all the <p> with "I'm a big hairy elephant" then I'd have just done
perl -pi -e "s/<p>/I'm a big hairy elephant/g;" clean27... although I had no need for that this time round.
Now I have a big file full of words. You can use wc to count them up, and tr to fiddle around with them, but both of these are a bit difficult to handle, because although this script will give you all the words, along with their frequency, it orders them alphabetically, whereas I want to see the most frequent words at the start of my file.
tr ' ' 'So instead I installed wf, a word frequency tool, from http://www.async.com.br/~marcelo/wf/ Which cost nothing, and which I could set up in a couple of minutes (once I remembered that I needed to give the installer permission to write to the directory that it lived in.
' < clean28 | sort | uniq -c > wordcount
Then you can start to look for common words (-i gives us case insensitivity (yay!) and -s sorts the output, largest to smallest):
wf clean28 -s -i > wordcountAt which point you'll find that there's still a lot of garbage, or at least words that are too common to be interesting:
9214 the
7579 i
6230 to
5479 and
5253 a
4425 of
3814 that
3629 it
3186 was
3126 in
1992 t
1887 my
1799 is
Unsurprisingly, because it's a personal blog, the word I is very frequent. But other words like and, the, it, was, etc are not going to be very helpful for figuring out what the most important content is. We'll need to build an exclusion list to deal with them.
2 comments:
I have no idea what this post is about, but I am very impressed nevertheless.
Ah, it's part of my 'Enry 'Iggins scheme to teach a computer to speak proper, like. This is the unglamourous boiler-room stuff - soon the shiny fun output should appear, and all and sundry will be impressed by my technological wizardry.
Post a Comment