Friday, June 25, 2010

Playing with language

Today I was looking at Get TEDPad, where a chap has run some statistical analysis against transcripts of the first five hundred or so TED lectures, and figured out how to generate snippets of new, non-existent TED speeches that sound either convincing and good quality English, or (when he switches it to the dark side) good quality English spoken by an idiot obsessed by whatever the New York Times tells him is important at this particular moment in time.

I looked at this a few weeks ago when Boing Boing linked to it, and at the time I thought it was an amusing little distraction, but nothing more.  It was only when it floated to the top of my conscious mind again today that I realised quite how clever it was, and, upon inspection of the explanation, how easily replicable it might be in other areas where we have lots of sentences that we might occasionally want to generate more of.

Like, say, if I want to take a week off blogging and not have anyone notice, perhaps I could use a similar approach to mine Comments(0) and generate new content.

I don't know if anyone would notice.  I hope I'd notice, but then you can never be sure.

More importantly, perhaps, is the ability to couple this with a friend's attempt to teach his computer to speak English by feeding it wall posts from Facebook.  If we could combine the convincing stitching together of language that Get TED shows is possible, with some API feeds back into Facebook, my dream of a Fakebook application could become reality.  I could generate a whole eco-system of non-existent people, all talking to one another on Facebook and all being my friend.  Well, not my friend, exactly, as the friend of the other version of me.  The one that I could point people at if they asked me to add them on Facebook, when I didn't really want to let them see my innermost secrets.

I think this is brilliant.  Not only would I be helping poison the well of social networking in some small way, I'd be expending vast amounts of time and effort (and wasting other people's time too) in order to avoid slightly offending somebody I don't even know who might hypothetically want to add me to their contact list.

And when I'm done there, I'll go on to LinkedIn.  And Bebo.  And Myspace.

And anywhere else where you need convincing English skills to fit in.

OK, maybe not the last two sites then.

Technical details, for if anyone else wants to try it, below.  (If you're not interested in such things, stop now.  If you are interested, email me for more details because this is only the high level version, or I'll post up the success or otherwise of this after the weekend.)

  1. Harvest a large set of sentences from one speaker, or a set of people with similar writing style.
  2. Determine the most common twenty words used.  (Probably filtering out connectives like and, the, so, because, etc)
  3. Generate a file of all the phrases that start and end with one of the key words - so every group of words starting with foo and ending in bar.
  4. Analyse that file to find the most common 3-grams and 4-grams between the words.
  5. Write a bit of code to stick the different key words back together again, interpolating the most common 3/4-grams so that we have natural transitions between the key words.
  6. Figure out a way to automate this so that it can iterate through any number of target speakers, spitting out posts to Blogger / Wordpress / any website you care to name.
  7. ...
  8. Profit!
Well, maybe not profit yet.  But if you're reading this in a week's time, and you've come here via a link from a blog that seems obsessed with Hong Kong and reading the fortune from the sight of Jason Statham's sweaty armpits, you'll know that my hivemind has reached at least some form of sentience.


Post a Comment