Text Analysis, Markov Chains and Bible Quotes Generator.
Text analysis and patterns are very interesting topics for me. I think every author has their own unique, distinctive pattern. The way they build sentences and phrases, not only in terms of tone, but the subconscious choices of word pairings and sequences. How their sentences begin and end, and what they put in the middle; they leave their cerebral fingerprints on paper, blog posts or even FaceBook status updates.
I think it is a very achievable, having enough base data, to identify the author by compiling a list of possible authors sorted by calculated probability of being the one, through analyzing a random, anonymous note, post or piece of text.
It’s very nice to blabber about all that, so lets start with something very simple. We will take a relatively small piece of sample text, analyze it and create text generator that will compile sentences based on the result of the analysis.
For the text sample we will take a random pieces from the bible. I found *.txt version of it here. Choice of the bible was somewhat random, but mostly because it is compiled of small verses. Using our simple analysis we will try to generate small text (pseudo quotes), the pattern of which will somewhat follow the pattern of the original verse authors.
Since I’m learning Python, all code samples here will be in Python. Code review and comments are greatly appreciated.
Text Analysis
How are we going to analyze our text? It is very interesting and very simple: we will split the sample text into pairs of 2 words and for every unique word we will count all the words that are following this word. This approach will allow us to use Markov Chain for our text generator.
Bear with me and everything will make sense soon. Here is our code snippet:
So what analyze method does is split the text by word pairs and then compiles a hash-map where the keys are unique words and the values are arrays of words that follow that given word, sorted by frequency.
In example.txt i have a text of my latest article. If you run this script this is what you will see in the console:

As you can see in my previous article, after the word “while” only the word “carrying” followed. Which usually means that there was only one occurrence of the word “while”. If you look at the word “will”, you will see that I like to put the word “not” or “bring” after it.
What does it give us? A very important information: if I’m writing the word “will” (statistically) you can “predict” the word that will follow, as you have guessed, will be “not”. Now we have enough information to build out text generator using Markov Chain process.
Markov Chain

There are plenty of articles with formulas, explaining what Markov Chains are and how to use them. We will not go into mathematical jungles today and concentrate just on the one Markov Chain’s property: memorylessness.
What does memorylessness mean? It means that the only thing that is available for the system to decide what would be the next move is current state. At any given point in time our system does not know (care) about the past. Even though Markov Chain is defined as a random process, in our case it’s not (why would we do the analysis then, you know?..).
Memorylessness is the perfect property for us since we do not care about the past, the only thing we care about is present (current word). When we have the present state, using the hash map we can “predict” what the second word will be. Let’s take a look at the final implementation where we will limit out quote with 30 words:
If you will run it you will see the resulted semi-adequate “quotes”:

Mission accomplished. With 54 lines of code we created a script that does very basic and simple text analysis and a quote generator. If you take a look at the sentences it’s generating, I would say that it is not too bad, without any “smarty” sprinkles it doesn’t totally look like a complete gibberish.
I would still say that our method was Markov-ish, because we did not actually calculate the probabilities of the next states. Which we could, but i promised no mathematical jungles, remember? ;) And if you noticed, I had to use randint to to pull one of the possible next states, which I think should have been avoided. One of the reasons I decided to take a short cut and use randint is infinite recursion. After n cycles Markov Chain settles into infinite repetitive pattern and you will see the same phrase repeating itself over and over again.
Next step for this project would be optimization. The script I have created is not very performant with large blocks of text (1MB and up). Also we need to be more careful with punctuation and build a logic block around it.