Research Retrospective #4: A Little Out Of Character : On Computational Analyses of Dramatic Text

In which the author reminisces about a series of coincidences along the academic road and writing a Master’s thesis on computational stylometry.
(Warning, post contains misty-eyed sentimentality and may not be to everyone’s taste)

And as the spotlights fade away,
And you’re escorted through the foyer,
You will resume your callow ways,
But I was meant for the stage.

The Decemberists
I Was Meant for the Stage
Her Majesty The Decemberists


The choice to study German instead of the default French at second-level was to a great extent accidental, as the German class was in need of extra bodies to make up the numbers that year or face extinction.
Swapping the language of Baudelaire for that of Brecht led to a multi-disciplinary undergraduate experience in computer science, linguistics and language, along with a revolutionary Erasmus experience at the LMU Munich.

The author’s choice to complete a Master’s by Research was also the road less travelled at that particular junction point, and involved reneging on a planned post-college year as an English language assistant in deepest Bavaria. The offer of an all-expenses paid stipend plus teaching hours proved too much to pass up. However, this offer was also subject to the first-choice candidate dropping out of the race at the last moment, a simple twist of fate once more.

Looking back at these early halcyon days through the telescopic rose-tinted lens of hindsight, it would prove to be a rare opportunity to have carte blanche to carry out research on any topic of interest, no matter how esoteric.

The originally submitted application for the TCD scholarship concerned linguistic judgments of grammaticality, which would have also made a fine topic of study if it hadn’t been for a chance departmental seminar on the stylistic changes in a writer with Alzheimers disease (Irish-born British author Iris Murdoch ).
As fate would have it this topic would go on to be examined thoroughly in the field of cognitive linguistics and psychology (see Garrard 2005), and inspired in part by this glimpse of the power of stylometry, syntax was swapped for statistical text analysis and so began a long day’s journey into the scholarly twilight of authorship attribution and the so-called digital humanities.


The main research question dealt with during those two years concerned the textual stylometry of characters written by playwrights, or more succinctly:

Do playwrights create stylistically distinct characters in their works?

The inspiration for this work came from a study of character in the work of Irish poet Brendan Kennelly, at that time a Professor in English at TCD. Vogel (2007) found a number of recurrent characters in his poetry, in particular the character of Ozzie, fluent in the dialect of Dublin’s Northside:


ozzie is stonemad about prades
so he say kummon ta Belfast for the 12th
an we see de Orangemen beatin the shit outa de drums
beltin em as if dey was katliks heads

from The Book Of Judas by Brendan Kennelly, presented in Vogel (2007)


The methodology used was borrowed from the corpus linguistics literature, relative frequencies of n-grams were compared to one another using the chi-squared test, then for each category, within and outside category similarity functions were computed using the Mann-Whitney ranks method. Thus, a textual segment was found to be more similar to either its own category (character, play, author) or everything else.

Once the system had been created to separate character contributions from one another, the analysis could begin in earnest.

Playing a role

Playwrights (and screenwriters) were chosen from numerous epochs including:

  • Jacobean/Elisabethan (Shakespeare, Marlowe, Jonson, Webster)
  • Victorian/Celtic Revival (Shaw, Wilde, Synge)
  • 20th Century American (Eugene O’ Neill).
  • Modern Screenplays (Cameron Crowe, William Goldman)

Based on the results of the experiments, those playwrights who incorporated dialectal orthography were more likely to produce distinct characters, to sum up:

“Spelling variation to indicate dialectic variation was captured as a stylistic feature”

Characters of this nature included Swedish sea captain Chris Christofferson from Eugene O’Neill’s Anna Christie who speaks in a strange Norwegian-English patois, stylistically distinct within O’Neill and contemporaries.

“Py yiminy, Ay forgat. She say she come right avay, dat’s all. Ay gat speak with Larry. Ay be right back. Ay bring you oder drink.”

from Anna Christie by Eugene O’Neill

Another O’Neill character of note is the character of Yank from The Hairy Ape, whose Noo Yawk aphorisms are clearly marked in speech:

G’wan! Tell it to Sweeney!
Say, who d’yuh tink yuh’re bumpin’? Tink yuh
own de oith?

from The Hairy Ape by Eugene O’Neill

Or as put by more eloquently by those in the literary criticism community:

It notes that characterization in O’Neill’s one-act sea plays is largely a matter of stage-dialect.

Field (1996)

Across the class divide

The character of Doolittle, the father of Eliza, from Shaw’s Pygmalion was found to be distinctive amongst those of Shaw’s characters, apparently by virtue of his addressing Higgins in the formal manner of a Cockney squire.
Shaw does not employ the post-modern dialectical orthography but manages to convey class and dialect through the method of address.

I thank you, Governor..

“Well, the truth is, I’ve taken a sort of fancy to you, Governor; and if you want the girl, I’m not so set on having her back home again but what I might be open to an arrangement. Regarded in the light of a young woman, she’s a fine handsome girl. As a daughter she’s not worth her keep; and so I tell you straight. All I ask is my rights as a father; and you’re the last man alive to expect me to let her go for nothing; for I can see you’re one of the straight sort, Governor. Well, what’s a five pound note to you? And what’s Eliza to me?.”

from Pygmalion by George Bernard Shaw

The villain of the piece

On Shakespeare and his contemporaries, Ben Jonson’s character Tucca from the Poetaster displays a choice command of era-specific insults:

“sort of goslings, when they suffered so sweet a breath to perfume the bed of a stinkard:
thou hadst ill fortune, Thisbe; the Fates were infatuate, they were, punk, they were.

I am known by the name of Captain Tucca, punk; the noble Roman, punk: a gentleman, and a commander, punk. I’ll call her.

–Come hither, cockatrice: here’s one will set thee up, my sweet punk, set thee up.

Aha, stinkard! Another Orpheus, you slave, another Orpheus! an Arion riding on the back of a dolphin, rascal! Shew them, bankrupt, shew them; they have salt in them, and will brook the air, stinkard.”

from The Poetaster by Ben Jonson

Although J.M Synge is well regarded for doing his part perpetrating the stage “Oirish” stereotypes who have punctuated drama and film in the 20th century, his characters were found to not possess a distinctive voice, in fact he was one of the least distinctive authors in the corpus when it comes to creating character.

“Ten thousand blessings upon all that’s here, for you’ve turned me a likely gaffer in the end of all, the way I’ll go romancing through a romping lifetime from this hour to the dawning of the judgment day.”

From The Playboy of the Western World by John Millington Synge

All’s fair in love and corpus linguistics?

The main conclusions from the thesis and related work were:

  1.  In general, not all characters of the dramatists studied are created equal (ly distinct)
  2.  If they are different from the others , it’s generally due to
    a. Orthography by way of dialect (Norwegian, New York, Cockney)
    b. Use of epithets (punk, cockatrice)
    c. Rhyme scheme

Some of the features discovered related to “class distinctions” and archetypes, however shortcomings included the lack of examination of stylistic features such as sentence length, lexical richness and other combinations of features from the corpus linguistics literature.

As evidenced during the literature review and post-submission corrections for the thesis, there is actually a very rich tradition of studying characterization on a textual level, going back to the late eighties with digital humanities pioneer John Burrows’ (1987) work on Jane Austen. This study was groundbreaking in that it was not focused on drama, easily separated into character, but fiction, which required painstaking separating of speech and descriptive text:

He examines the relationship of style within particular character idiolects and using the thirty most common words in each idiolect and three passages of three hundred words, carries out tests using linear regression which assign the highest correlation between the selected dialogue passages and their corresponding character idiolects, in other words, subsections of character idiolects match the rest of that characters dialogue text.

Lynch (2009, p 17)

Most frequent words were examined using the Delta statistic described in Burrows (2002), which has since become the textual metric de rigeur in the field of digital humanities.

Recently, Rybicki (2006) looked at character in translation, visualising the characters of Henryk Sienkiewicz’s epic dramas by character type over two translations, which inspired some more work by Lynch (2009) which attempted to do the same for Henrik Ibsen in translation. The following plot displays the preservation of character style in translation by clustering using the 100 most frequent words.

Ibsen Characters

Plot of main characters in Ibsen’s Ghosts by an early version of Rybicki and Eder’s Delta plotter for R

Very recent work by Jon Reeve investigated class in Shakespearean drama, drawing distinctions between the language of kings, queens and others using computational stylometry.

Lin (2011) not only analyses the language of character in 862 film scripts and learns to categorise characters by genre, gender, director and film period with varying accuracy, they also use their learned models to generate speech for computer game characters in the style of a particular film character, controlling parameters such as hedges and use of stammer.

Annie Hall: Alvy. Indiana Jones: Indy.
I don’t know. People say Cartmill is st-strange, alright? Err… on the other hand, I don’t rush to judgment. I don’t rush to judgment, but people say Cartmill is strange.
Right, I am not sure, would you be? I will tell something you because you br-brought me cabbage. I will tell something you since you brought me cabbage.
Oh I am not sure. Wolf wears a hard shell. On the other hand, he is ge-ge- gentle, isn’t he? Wolf is gentle but he wears a hard shell.
I see, I don’t know. I respect Wolf, wouldn’t you? He, however, isn’t my close friend. Wolf isn’t my close friend. But I respect him.
Yeah, I don’t know. Sparrow con- veys excitement to my life, so I am fr-fr-friends with her. I am friends with Sparrow since she brings excitement to my life.

Lines as spoken by Alvy from Annie Hall and Indy from Indiana Jones from Lin (2011)


Lin, Grace I., and Marilyn A. Walker. (2011) “All the World’s a Stage: Learning Character Models from Film.” AIIDE.

Rybicki, J. (2006). Burrowing into translation: Character idiolects in Henryk Sienkiewicz’s trilogy and its two English translations. Literary and Linguistic Computing, 21(1), 91–103.

Lynch, G. (2009). Computational Stylometry and Analysis of Style: A Study of Characterization in Playwrights (Masters dissertation, Trinity College Dublin.).

Vogel, C., & Lynch, G. (2008). Computational Stylometry: Who’s in a Play?. In Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction (pp. 169-186). Springer Berlin Heidelberg.

Vogel, C., & Brisset, S. (2007). Hearing Voices in the Poetry of Brendan Kennelly. Belgian Journal Of English Language and Literature, 1-16.

Garrard, Peter, et al (2005). “The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author.” Brain 128.2 : 250-260.

Field, Brad. (1996): “Characterization in O’Neill: Self-doubt as an Aid to Art.” The Eugene O’Neill Review 126-131.

Burrows, J. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Clarendon Pr.

The Style Counselor : AKA We Are How We Write

On predicting authorial demographics from writing style and the implications of same.

To write it, it took three months;
to conceive it three minutes;
to collect the data in it all my life.

Francis Scott Key Fitzgerald (1896-1940)

Given the knowledge that both our personal and public data is processed on a daily basis by multinational concerns who manage, curate and regurgitate it wholesale to digital Mad Men with designs on our wallets, we may start to become a tad wary about what and how we post online.

Facebook likes have been shown to know more about our personality than family and friends (Youyou 2015), but the style of language and words we use can also be used to betray these demographics.

Language and Gender

A recent PLoSOne paper by Schwartz et. al. (2013) analysed the language used in a large number of Facebook posts and determined defining word usage frequencies for both gender, personality type and age, where related topical clusters highlight the age difference between getting wasted (19-22) and drinking responsibly (23-29).

In their dataset, on average males talk more about Xbox, sports, war and taxes replete with a higher volume of expletives, while women converse about shopping, love, family, pets and friends. Pretty broad strokes across middle America, but interesting analysis all the same.

Moving a little off topic, writing style has been found to betray gender-specific traits, Moshe Koppel’s seminal 2002 investigation into literary text from the British National Corpus established stylistic features of male and female language.

In brief, male authors use more determiners (the, a) and female authors use a higher proportion of prepositions such as for and with, coupled with a general preference towards a higher proportion of pronoun usage for female authors. Negation markers are used more frequently by women, and combinations of these features were used to label the gender of an author with an 80% accuracy rate.

Casual social media users should not be lured into a false sense of security however as systems don’t need to read your unpublished Great American Novel to predict your details, the Tweetgenie (website is in Dutch only) system takes your tweets and predicts age and gender. (This author’s demographics were predicted accurately)

Another tool, GenderGuesser from Dr. Neal Krawetz based on previous work by Argamon, Koppel and associates, predicts author gender from formal and informal text in a web browser ( Try it here). The authors claim that it reports 60-70% accuracy on guessing gender and is trained on US English, meaning that European English can outwit it somewhat, although investigations using this blog as a source appear promising.

The “Write” Profile

Although links between gender and language use have been studied extensively, personality type, native language, political affiliations and other more fine-grained characteristics can also be extrapolated from our writing style in a similar fashion.

Gill (2009) found that the blogs of neurotics reflected more negative sentiment and statements about themselves, while extraverts express more balanced emotions and references to events and happenings, choosing to focus on third parties more frequently.

Wong and Dras (2008) examined syntactic parses of the writing of non-native English speakers and found that an author’s native language could be classified with 80% accuracy on a corpus of 90 learner essays each from a sample of seven languages, (Bulgarian, Czech, French, Russian, Spanish, Chinese, and Japanese). Their machine learning system identified language-specific parse rules such as noun phrases without determiners (indicative of Chinese native speakers) and prepositional phrases such as according to which corresponded to a direct equivalent in Chinese used less frequently by authors from other language backgrounds.

Author Unknown?

These research projects demonstrate that no matter how we try to obfuscate our profiles with fake names, ages and profile pictures, our real selves can be revealed by a simple Oxford comma, self-referential tweet or more likely, a longitudinal analysis of our writing style.

This raises an interesting research question:

Could a language processing system be developed to obfuscate such variables?

An onion router for our blog posts, or a voice distortion system for our online social presence?
The human voice is frequently manipulated using sonic frequency analysis, could the same be done for our writing using word frequency analysis?

Imagine specifying a gender, personality and age setting and allowing a system to take our words and synthesise them in another style?
Such a system could also be useful for text personalisation and summarization, taking a wordy blog post and summarising it succinctly in a tweet, or dialing down the complexity of a legal policy document for a non-native speaker.

However, a darker side to the availability of such a system in the wild could be the severe ethical implications regarding child safety on the Web. A large body of research in the domain of deception detection is currently dedicated to systems to detect cyber-predators in online chat scenarios using linguistic and non-linguistic features. (See Bogdanova et. al. 2012)

As a coda to this cautionary tale, read about how J.K Rowling’s gender-bending pseudonymous work The Cuckoo’s Calling was unmasked by computational stylometry expert Dr Patrick Juola on the Language Log.

Caveat scriptor!


Moshe Koppel, Schlomo Argamon and Anat Rachel Shimoni (2002), Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4), November 2002, pp. 401-412.

Nature news article on Koppel’s work

TweetGenie press coverage and scientific basis

Nguyen, D., Gravel, R., Trieschnigg, D., & Meder, T. (2013). ” How Old Do You Think I Am?”; A Study of Language and Age in Twitter. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press.

Wong, S. M. J., & Dras, M. (2011). Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1600-1610). Association for Computational Linguistics.

Youyou, Wu, Michal Kosinski, and David Stillwell. “Computer-based personality judgments are more accurate than those made by humans.” Proceedings of the National Academy of Sciences (2015): 201418680.

Gill, Alastair J., Scott Nowson, and Jon Oberlander. “What Are They Blogging About? Personality, Topic and Motivation in Blogs.” ICWSM. 2009.

Schwartz, H. Andrew, et al. “Personality, gender, and age in the language of social media: The open-vocabulary approach.” PloS one 8.9 (2013): e73791.

Bogdanova, Dasha, Paolo Rosso, and Thamar Solorio. “Modelling fixated discourse in chats with cyberpedophiles.” Proceedings of the Workshop on Computational Approaches to Deception Detection. Association for Computational Linguistics, 2012.