Monday, August 23, 2010

Working on a phrase gang

After work today I came home and carried on slogging through my auto-gibberish generator.  It's been ages (well, 2 months since I started working on this) - partly because I've been rather lacksidaisical at working on it, what with everything else going on, and partly because it's harder than it looked initially.

Let's start by defining some terminology: we're trying to build a graph, which is a set of nodes connected by lines.  Rather like a string bag, or at least a string bag that may exist in a many-dimensional space.  Or be full of knots.  Each node is one of the fifteen significant words that I've chosen, and the connection between each pair of nodes is made up of the most statistically significant 4-word phrase found between those two words.1  When it's complete, a random path through the graph from any point to any other should yield an interesting, or at least readable, bit of prose.

The simple part was extracting the phrases; what turns out to be quite a bit more hassle is sticking them back together again.  My inspiration, the getTedpad, only uses ten different significant words to form the graph of terms to permute through.  Whereas I've chosen fifteen, which means I have 225 permutations to join together, which is more than twice as many phrases as the getTedpad creator had to curate.

Further, it's not just a matter of looking for the most common four-word phrase between two of the significant words, and then jamming them back together.  That would just get you unhelpful results like
... last any of the other it the end of the before but I don't think people ...
which lacks any flow and is rather obviously the result of a computer jamming words together at random.  Can you see the joins?
... last any of the other it the end of the before but I don't think people ...
Thus we have to think about what will fit the phrases back together naturally, and while it's easy to solve for this particular case:
... last any of the other audience members who don't like it will show that the end of the beginning was long before but I don't think it will have any effect on anyone; you have to remember people ...
 it's still quite a struggle to put together lots of combinations of phrases in a way that will tie together nicely.

Punctuation helps.  Full stops are pretty much essential, because otherwise we get horrible run-on sentences that never seem to end.  As the original phrase extraction routines only pick phrases within sentences, that means I have to be looking for as many places to start and finish sentences as possible, while retaining the 4-word phrases that we think are what's significant about the writer I've mined to generate my corpus.

The other thing you have to be very aware of, and avoid, is the spectre of the garden-path sentence: that is, where a sentence is constructed to make you expect a certain context, and then, when you've been led sufficiently up the garden path, it deviates and leaves you feeling like an idiot.  (My favourite example is here.)  Useful in a comedic setting, but if the whole gibberish generator does nothing but spit out garden-path sentences, it may not give us a convincing paragraph.  (Although thinking about it, possibly a garden-path sentence generator would be quite useful.  Or it would just appear that you were talking to somebody with some fairly serious problems.)

However, it's difficult to avoid making garden-path sentences.  Since I have words in the graph like 'last' and 'work' and 'show', which aren't determinately always nouns or verbs or adjectives, it's quite easy to build nodes that fit well between the start and end words, but will often result in garbage once several nodes are joined together.  The only clean approach I've found to this is to look at a phrase that will end with a particular word - let's say 'last' - and then fix all the nodes that will begin with 'last' so that any combination of those fits together.  Then you have to go onto the next word, and ensure that it flows neatly into the phrases attached to the corresponding fifteen nodes.

Needless to say, I only did this after I'd been through and picked the 'easy' phrases first, where it seemed as though there was a natural way to link the first and last words through the appropriate phrase, without realising that when you then joined two nodes, you got something that made no sense.  No sense at all.

Mind you, the alternative is to overuse words like 'that' and 'who' and 'which'; it may be that I will need a further pass over all this when the first version is complete, to tune it for something more readable.

It's not complete yet. I've curated, for want of a better word, all the nodes that begin with 8 of the important words, and all the nodes that end with 5 of the important words, but that doesn't give a complete set yet - although it does mean I do have a 4 by 4 graph to begin with, which gives results like
... work to be able to piss on the grave of all my co-writers from Week Ending.  And still have some left to make you who expect to be able to talk to someone at any volume, but surely not at the beginning of day when I am going to peel a potato, crying, alone in my “Free Hot Dog – bring the buns” t-shirt. When I first saw it, I was going to wear it all ...
and
... you who expect to be able to talk to someone at any volume, but surely not at the beginning of day when I am going to peel a potato, crying, alone in my fear that I am never going to do something that will work , which makes sense, until you realise that's just the plot of last night's episode of 'The Simpsons'. Which makes me think that the ...
Convinced?  Well, it's getting closer, right?

1 Apologies to the statistically minded. For this particular context, significant is going to be taken to be identical in meaning to "most frequent". It's up to the reader to consider some of the problematic consequences of this approach...

0 comments:

Post a Comment