How my tweets were created

The Adventures of Tom Sawyer is in the public domain, meaning it's perfect for this sort of project since it's free of most copyright issues. I obtained the plain text version of Tom Sawyer from the good folks at Project Gutenberg. Here's how I converted this text file in to a year's worth of tweets...

(code available on github)


Once I removed the Project Gutenberg header/footer, as well as the Table of Contents from the file (they would make for some really boring tweets, after all) I wanted to further compress the text where possible. There were some easy substitutions I could make to save space (e.g. "--" to "-"), but I decided to not translate the entire text to 1773speak because I find it annoying. Here's a nearly complete list of substitutions I made:

  • All written numbers to numeric characters (e.g. "one" to "1", "fifteen" to "15", "hundred's" to "100s", "twenty dollars" to "$20", and so on)
  • "and" to "&" (this alone saved over 6000 characters, or about 45 tweets)
  • "you" to "U"
  • "be" to "B"
  • Removed the apostrophes from "can't", "don't", "ain't"
  • "with" to "w/", "without" to "w/o"
From a total of 387,560 characters to 374,802, about 3.3% compression. In other words, with only 5 minutes of work and a handful of substitutions I shaved off 91 tweets from the project. I was surprised this compression wasn't larger, but I was not aggressive or clever (e.g. "anyone" to "NE1").

Aside: If it doesn't exist already, somebody should create a packaged called "bybbreviate" or similar that implements progressively harsher and harsher phonetic substitutions to get the maximum text compression that maintains readability... human testing of readability for this idea would also be amusing, like a game of telephone.

It was important for me to preserve the readability of the text, hoping that people might actually want to follow along, and so I opted to not make a few obvious other substitutions (e.g. "to/too" to "2", which causes confusions with numbers). After the compression, I would naïvely have about 2,700 tweets if I used a running 140character window through the text. Given all the dialog, this would quickly degenerate and become hard to follow as a reader (tweets ending mid-word would be a nightmare!) Thus I needed to break the text in to more logical units: sentences!

NLTK comes with a great sentence tokenizer (à la Punkt) that did a good job of parsing the text into sentences, taking in to account dialog, questions, exclamations, etc. According to this method, there are 4,856 sentences in Tom Sawyer. However, many of these sentences (about 700 in fact) are longer than 140 characters. Since my primary blog is all about these kinds of details, here's a cool histogram of the sentence lengths:
That's a sweet exponential function.

For the 700 sentences longer than the Twitter 140char limit I split them in to roughly even length sub-tweets, and ensuring that whole words were not broken. Since this again increased the number of tweets, I also combined tweets for sentences shorter than 140char. Given all the dialog this produced a huge savings in tweets.

In the end I arrived at 3709 tweets, just about perfect to post 10 per day for a year.