Nikhil Mannem.
Welcome to the fifth and final installment of spoontistics! I'm getting teary-eyed already...we're approaching the end of our fabulous journey of numbers and analysis. Today's article will be rather different than the other ones -- rather than looking at the actual spoons game, we'll turn to look at its Goderators. Specifically, we'll take a look at the way they speak! We'll analyze the English of their "sarcastic and arrogant Goderator personas" in the emails and the cannon articles...oh boy, this will be fun, I can already feel it.
Note that I've included some code in this article. Feel free to skip it. I'm trying out a hipster thingy where people integrate code with analysis to like "show their work". If you're ever interested in the method behind the madness, knock yourself out. If not, just ignore the code.
Okay, let's get started!
from bs4 import BeautifulSoup
from collections import *
import os
import string
import random
import requests
website_path = 'website/lynbrook2016.co.nf/spoons'
website_path = 'website/lynbrook2016.co.nf/spoons'
First off let's download all the pages from the spoons website: every single daily cannon and every single email for immunities and trump cards.
%%capture
#!wget -r lynbrook2016.co.nf/spoons -P website/
Now we need to process the data a bit. We don't care about the titles like "THE DAILY CANNON" or the lists of people that died every day -- we only want the sassy stuff that the goderators wrote. So let's extract that text from all the pages.
def get_text(html):
'Given html of a spoons page, return string of its Goderator-written content'
soup = BeautifulSoup(html, 'lxml')
paragraphs = soup.findAll('p', style='margin-bottom: 50px;')
text = '\n'.join(paragraph.get_text() for paragraph in paragraphs)
text = '\n'.join(line for line in text.splitlines()
if line and not line.isspace())
return text
texts = []
for file_name in os.listdir(website_path):
if not file_name.endswith('.html'):
continue
file_path = os.path.join(website_path, file_name)
html = open(file_path).read()
text = get_text(html)
texts.append(text)
all_text = '\n\n'.join(texts)
Now we have all the text that the Goderators have ever written under their personas on their website. Yay! Let's take a look at various aspects of the data.
random_index = random.randrange(len(all_text))
snippet = all_text[random_index:random_index + 200]
print 'Sample of goderator text:\n'
print '"' + snippet + '"'
Sample of goderator text:
"g's shoulder. The auditorium held a moment of silence for the dead for about 5 seconds, after which basically all hell broke loose. I mean, who's gonna run Denmark now? WHERE IS FORTINBRAS??? AHHHHHhh"
no_punc = ''.join(c for c in all_text if c not in string.punctuation)
words = no_punc.lower().split()
vocab = set(words)
print 'Number of articles:', len(texts)
print 'Total number of words:', len(words)
print 'Total number of characters:', len(all_text)
Number of articles: 68
Total number of words: 30532
Total number of characters: 172772
from textstat.textstat import textstat #shittyass package
grade_level = textstat.automated_readability_index(all_text)
print "Reading level:", grade_level
Reading Level: 7.3
The goderators have written a total of 68 articles, including posts on The Daily Cannon, emails about immunities, and emails about trump cards. In total, this amounts to 30 thousand words, or about 175 thousand characters. That's enough words to write the common app essay 46 times! Another way to look at it is, if they earned a dollar for every key they pressed on their keyboard, they'd now have enough to buy a lamborghini. Not that anyone would ever earn a dollar for something as simple as smacking a keyboard...
Their English has the reading level of a seventh or eight grader -- reasonably advanced prose, but not overly formal or complex. For comparison, Wikipedia reads at about a 10/11 grade level on average, whereas green eggs and ham reads below the level of a first grader.
c = Counter(words)
top_words = [x[0] for x in c.most_common(200)]
print '30 most frequent words:\n'
print ', '.join(top_words[:30])
30 most frequent words:
the, to, a, of, and, we, in, you, that, was, for, it, this, on, is, as, be, at, your, his, with, her, but, killer, all, not, were, are, out, have
# Compare to a list of the 1000 most frequent words in English in general
freq_url = 'https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt'
r = requests.get(freq_url)
top_words_general = r.text.splitlines()
unusual_words = [word for word in top_words
if word not in set(top_words_general)]
print len(unusual_words), 'unusually frequent words:\n'
print ', '.join(unusual_words)
50 unusually frequent words:
killer, not, immunity, today, spoons, spoon, into, heshe, its, because, going, days, really, video, around, kills, dont, trump, target, killed, being, goderators, email, youre, actually, didnt, hisher, 3, i, another, someone, thats, weve, died, killers, gauri, immunities, away, rate, spooners, everyone, without, quota, friends, alive, decided, things, daily, looking, shayok
As expected, the most frequent words that the goderators use are words like "the", "to", "a", "and", and other common words in the English language. However, there's also a lot of unusual words -- words that the goderators use a lot but don't get normally used that much in English. A full 25% of their top 200 words aren't even in the list of the top one thousand English words. Talk about weird.
Well, let's take a look at some of these bizzare words that characterize Goderator-speak.
First off, a lot of these unusual words are spoons-specific jargon -- stuff like "spoon(s)", "killer(s)", "died", "alive", "goderators", "immunity/immunities", "trump", "quota", and more. If you didn't know that spoons was just a game, the website would sound like the classified documents of some violent drug cartel with some obsession with utensils that keep them alive and immune from the dangers of Trump...or something like that. Bleh. It's got some pretty cliquish insider vocabulary, that's for sure.
Other words relate to the daily nature of the Daily Cannon: "today", "daily", "days", etc. If you guys have noticed, too, the Daily Cannon always likes to use certain phrases when describing how the killer ruthlessly pursues their target -- like "looking [for their target]", "decided [to do something]", and "[run] away". Because of how frequent these phrases are, the words in them make it to the list of unusual words.
Two other specific words are pretty unusual: "he/she" and "his/her". As many of you already know, this is the goderators' (feeble) attempt at preserving killers' identities...at the expense of making the articles sound slightly awkward. C'mon. Just use "they" and "their" like any other normal person, guys.
Finally (and most importantly), curiously common words reveal the goderators'...heartlessness. Wouldn't you guess that both "without" and "friends" are on this list of unusual words? Or that the word "not" ranks incredibly high -- as in, the odds are not in your favor? What about all the sassy words like "really" and "actually"? Or the high frequency of "that's", like, "that's how you'll die."? Yeah, I thought so.
Now, to wrap off our analysis with a bang, it's time for some pretty crazy shit -- we're gonna try to get our computer to generate fake goderator English! Mwahahaha! I think we've got enough data to make a reasonable model. For those of you curious, I'll be using an "unsmoothed maximum-likelihood character level language model". (WHAAAA?????) I copied my code from this article, which does an extremely good job of explaining how the model works. It's super duper cool and surprisingly simple to understand -- definitely check it out if you're up for it.
Technical section:
The model guesses the next character based on the previous 7 letters. For example, if the previous 7 letters are " a spoo", the next letter is probably 'n'. Then it just goes on and on, feeding the previous 7 letters and generating letters one at a time.
How the model guesses the next letter is pretty straightforward -- it's based on the chances of seeing the next letter after seeing the previous ones. Mathematically, that's P(c|h), where c is the next character, and h are the previous n letters. We learn this function by doing a ridiculously simple count-and-divide approach: find all the times h appears in the data, and count how many times each c appears after it. Then divide by the number of times h appears to get the probability of c given h.
For example, out of the 7 times that " a kill" appears in the data, 'e' appears next 1 time (for the word "killer"), a space appears next 4 times, and a comma appears next 2 times. That means after the generator generates the text " a kill", there's a 14% chance it'll generate 'e', a 57% chance it'll generate ' ', and so on and so forth. That's all there is to it!
def train_char_lm(text, order=7):
"""Train model on text. Return dictionary that takes previous `order` letters
as input and returns probability distribution over next character.
"""
lm = defaultdict(Counter)
pad = "~" * order
text = pad + text
for i in xrange(len(text)-order):
history,char = text[i:i+order],text[i+order]
lm[history][char] += 1
def normalize(counter):
s = float(sum(counter.values()))
return [(c,cnt/s) for c,cnt in counter.iteritems()]
outlm = {hist:normalize(chars) for hist,chars in lm.iteritems()}
return outlm
def generate_letter(lm, history, order=7):
history = history[-order:]
dist = lm[history]
x = random.random()
for c,v in dist:
x = x - v
if x <= 0: return c
def generate_text(lm, order=7, nletters=2000):
history = "~" * order
out = []
for i in xrange(nletters):
c = generate_letter(lm, history, order)
history = history[-order:] + c
out.append(c)
return "".join(out)
model = train_char_lm(all_text)
print generate_text(model)
Welcome back, it wasn't holding food to be retold again...
At 11:59 PM.
If the bloodshed, then you are vulnerable, difficult to continue. Because we're confirmed that this all your spoon, her killer (who coincidentally have no idea what's Santa without a double kill we mean better at spoons" in general. But here's a correlations! You may win.
The Tricksters. After school in LA and betraying his legs together. Obviously, we were noteworthy.
Now, to end with a fairly peaceful for every brunch or frowning or expression of them are still new and soothing -- 0% of the earlier in the extra layer of submission to you. Just things aside, this one of his Spoons define as double kill! Yes, we can measures as well. Unfortunately, when hope seemed out of sight, out of second day of, you do. Keep up with roses, chocolates, and soon as she appropriate this honor. Multiple cameras)! If you all run around its eerie parallel to a play we have to being killed and corners like a human-cocoon. To be immune, you must:
- Tie a leash held it to our heart for tomorrow, we're not a deadly but silently snuck up again saw that they tried recovery emails to Trump Card was, we really anticipated. We're ending, too. That actually. But don't we just splendid. And Nikkia, we supposed to get his guard down and we hope, you'll have cross country runner. Especial education if her fingers forming Lynbrook's first diagram of Alvin's routes at labeled specific points to reindeer?
Just kidding we talked about Joanne, so he/she simpler terms, he was there are very clearly wrong hands.
In his fifth period, and we have the chest (must be the Percy Jackson movies, the weekend with his Immunity is not of story. For those of you who don't leash at all. So today:
- Why the Eff You Lyin'
- DJ Khaled Inspirational video. Then it takes are marked in red -- then today, that in an interesting killed.
In the middle of the lack of freshman, sophomores, and the Fina
Holy cow! That's hilarious! Wow. The generator turned out a lot better than I expected -- I'm thoroughly impressed. Whoooooo.
Guess what, Kenny, you're out of a job. Remember those slow days without anything interesting to write about when you resorted to literary-analysis-worthy descriptions of rocks outside your window? Well, those days are over. Just turn on this generator, and you got your blog post for the day. Here is a longer sample of generated text if you want more fake goderator entertainment. Or if you want to play around with the generator yourself, all the code's available on this blog post! ;)
On that strong note, spoontistics comes to an end! I sincerely hope this entertaining and enlightening experience has shown you guys how awesome statistics can be. It's a really beautiful field that Iams' brutal tests fail to do justice for. Thanks so much everyone for joining along on the journey!