i don't recall how my friend did it, but you can easily implement this in c++ (or java i guess) using a multimap:
http://www.cppreference.com/cppmultimap/index.html
simply loop over your text, and add each (word,word) pair to the multimap. that's it for the (simplistic) building.
to generate the body:
1. pick a random word in the mmap
2. construct a discrete probability density function on the current word and use it to pick the next:
2a. find the sum S of the counts of all associated words
2b. pick a random integer K in the range [1,S]
2c. use that to pick the next word (i.e. map K into the bin to which a word belongs)
3. emit the word and goto 2 until your thesis is done
what we're doing here is constructing a markov chain:
http://en.wikipedia.org/wiki/Markov_chain
let it be known to all that the monte carlo method is the greatest thing on earth.
edit: i just thought of a rather natural generalisation of the algorithm, which ought to produce much more coherent text. what we do is start taking context into account (i.e. this will no longer be a CFG); this has the downside that you'll probably start running into strings of words that are
too structured. how to get around it? switch between the two modes. how to do that? probabilistically, based on the word PDF (ie if the word-pair matching is a poor one for the current one, we'll tend to not use it). since this is probably not making sense to anyone, here's the algorithm (completely untested), which is easily implemented using the stl pair and tuple templates:
0: make
two multimaps, one for (word,word) pairs and one for ((word,word),word) pairs, call them C0 and C1.
1. pick a random word in the mmap
2. construct (up to) two discrete probability density functions on the current word and use them to pick the next word:
2a. find the sum S1 of the counts of all associated word-pairs from C1
2b. pick a random integer U in the range [1,S1] and get the probability P1 of the word-pair to which it maps (P1 = count(wordpair) / S1)).
2c. if P1 is greater than some "context binding" threshold (i.e. is a good match), goto 3
2d. repeat the process using the PDF given by C0 to get the next word
3. emit the word and goto 2 until your lecture notes on quantum field theory are done
this process of "dependent context boosting" can be extended to arbitrarily high orders, with the higher orders cascading down to the lowest (0th order) using some exponential distribution for "context binding" thresholds (instead of just a constant) to compensate for the P^n (where P < 1) falloff of probabilities given n sequential words.
i'd be surprised if more than
*thumbsuck* 4 words of context makes a noticable difference, but then any REAL results are more than welcome
of course the more context one uses, the larger the body of text used to build the PDFs should be, to give the context matching "more legroom" in which to work.