In which the author reminisces about a series of coincidences along the academic road and writing a Master’s thesis on computational stylometry.
(Warning, post contains misty-eyed sentimentality and may not be to everyone’s taste)
And as the spotlights fade away,
And you’re escorted through the foyer,
You will resume your callow ways,
But I was meant for the stage.
I Was Meant for the Stage
Her Majesty The Decemberists
The choice to study German instead of the default French at second-level was to a great extent accidental, as the German class was in need of extra bodies to make up the numbers that year or face extinction.
Swapping the language of Baudelaire for that of Brecht led to a multi-disciplinary undergraduate experience in computer science, linguistics and language, along with a revolutionary Erasmus experience at the LMU Munich.
The author’s choice to complete a Master’s by Research was also the road less travelled at that particular junction point, and involved reneging on a planned post-college year as an English language assistant in deepest Bavaria. The offer of an all-expenses paid stipend plus teaching hours proved too much to pass up. However, this offer was also subject to the first-choice candidate dropping out of the race at the last moment, a simple twist of fate once more.
Looking back at these early halcyon days through the telescopic rose-tinted lens of hindsight, it would prove to be a rare opportunity to have carte blanche to carry out research on any topic of interest, no matter how esoteric.
The originally submitted application for the TCD scholarship concerned linguistic judgments of grammaticality, which would have also made a fine topic of study if it hadn’t been for a chance departmental seminar on the stylistic changes in a writer with Alzheimers disease (Irish-born British author Iris Murdoch ).
As fate would have it this topic would go on to be examined thoroughly in the field of cognitive linguistics and psychology (see Garrard 2005), and inspired in part by this glimpse of the power of stylometry, syntax was swapped for statistical text analysis and so began a long day’s journey into the scholarly twilight of authorship attribution and the so-called digital humanities.
The main research question dealt with during those two years concerned the textual stylometry of characters written by playwrights, or more succinctly:
Do playwrights create stylistically distinct characters in their works?
The inspiration for this work came from a study of character in the work of Irish poet Brendan Kennelly, at that time a Professor in English at TCD. Vogel (2007) found a number of recurrent characters in his poetry, in particular the character of Ozzie, fluent in the dialect of Dublin’s Northside:
ozzie is stonemad about prades
so he say kummon ta Belfast for the 12th
an we see de Orangemen beatin the shit outa de drums
beltin em as if dey was katliks heads
from The Book Of Judas by Brendan Kennelly, presented in Vogel (2007)
The methodology used was borrowed from the corpus linguistics literature, relative frequencies of n-grams were compared to one another using the chi-squared test, then for each category, within and outside category similarity functions were computed using the Mann-Whitney ranks method. Thus, a textual segment was found to be more similar to either its own category (character, play, author) or everything else.
Once the system had been created to separate character contributions from one another, the analysis could begin in earnest.
Playing a role
Playwrights (and screenwriters) were chosen from numerous epochs including:
- Jacobean/Elisabethan (Shakespeare, Marlowe, Jonson, Webster)
- Victorian/Celtic Revival (Shaw, Wilde, Synge)
- 20th Century American (Eugene O’ Neill).
- Modern Screenplays (Cameron Crowe, William Goldman)
Based on the results of the experiments, those playwrights who incorporated dialectal orthography were more likely to produce distinct characters, to sum up:
“Spelling variation to indicate dialectic variation was captured as a stylistic feature”
Characters of this nature included Swedish sea captain Chris Christofferson from Eugene O’Neill’s Anna Christie who speaks in a strange Norwegian-English patois, stylistically distinct within O’Neill and contemporaries.
“Py yiminy, Ay forgat. She say she come right avay, dat’s all. Ay gat speak with Larry. Ay be right back. Ay bring you oder drink.”
from Anna Christie by Eugene O’Neill
Another O’Neill character of note is the character of Yank from The Hairy Ape, whose Noo Yawk aphorisms are clearly marked in speech:
“G’wan! Tell it to Sweeney!
Say, who d’yuh tink yuh’re bumpin’? Tink yuh
own de oith?”
from The Hairy Ape by Eugene O’Neill
Or as put by more eloquently by those in the literary criticism community:
It notes that characterization in O’Neill’s one-act sea plays is largely a matter of stage-dialect.
Across the class divide
The character of Doolittle, the father of Eliza, from Shaw’s Pygmalion was found to be distinctive amongst those of Shaw’s characters, apparently by virtue of his addressing Higgins in the formal manner of a Cockney squire.
Shaw does not employ the post-modern dialectical orthography but manages to convey class and dialect through the method of address.
I thank you, Governor..
“Well, the truth is, I’ve taken a sort of fancy to you, Governor; and if you want the girl, I’m not so set on having her back home again but what I might be open to an arrangement. Regarded in the light of a young woman, she’s a fine handsome girl. As a daughter she’s not worth her keep; and so I tell you straight. All I ask is my rights as a father; and you’re the last man alive to expect me to let her go for nothing; for I can see you’re one of the straight sort, Governor. Well, what’s a five pound note to you? And what’s Eliza to me?.”
from Pygmalion by George Bernard Shaw
The villain of the piece
On Shakespeare and his contemporaries, Ben Jonson’s character Tucca from the Poetaster displays a choice command of era-specific insults:
“sort of goslings, when they suffered so sweet a breath to perfume the bed of a stinkard:
thou hadst ill fortune, Thisbe; the Fates were infatuate, they were, punk, they were.
I am known by the name of Captain Tucca, punk; the noble Roman, punk: a gentleman, and a commander, punk. I’ll call her.
–Come hither, cockatrice: here’s one will set thee up, my sweet punk, set thee up.
Aha, stinkard! Another Orpheus, you slave, another Orpheus! an Arion riding on the back of a dolphin, rascal! Shew them, bankrupt, shew them; they have salt in them, and will brook the air, stinkard.”
from The Poetaster by Ben Jonson
Although J.M Synge is well regarded for doing his part perpetrating the stage “Oirish” stereotypes who have punctuated drama and film in the 20th century, his characters were found to not possess a distinctive voice, in fact he was one of the least distinctive authors in the corpus when it comes to creating character.
“Ten thousand blessings upon all that’s here, for you’ve turned me a likely gaffer in the end of all, the way I’ll go romancing through a romping lifetime from this hour to the dawning of the judgment day.”
From The Playboy of the Western World by John Millington Synge
All’s fair in love and corpus linguistics?
The main conclusions from the thesis and related work were:
- In general, not all characters of the dramatists studied are created equal (ly distinct)
- If they are different from the others , it’s generally due to
a. Orthography by way of dialect (Norwegian, New York, Cockney)
b. Use of epithets (punk, cockatrice)
c. Rhyme scheme
Some of the features discovered related to “class distinctions” and archetypes, however shortcomings included the lack of examination of stylistic features such as sentence length, lexical richness and other combinations of features from the corpus linguistics literature.
As evidenced during the literature review and post-submission corrections for the thesis, there is actually a very rich tradition of studying characterization on a textual level, going back to the late eighties with digital humanities pioneer John Burrows’ (1987) work on Jane Austen. This study was groundbreaking in that it was not focused on drama, easily separated into character, but fiction, which required painstaking separating of speech and descriptive text:
He examines the relationship of style within particular character idiolects and using the thirty most common words in each idiolect and three passages of three hundred words, carries out tests using linear regression which assign the highest correlation between the selected dialogue passages and their corresponding character idiolects, in other words, subsections of character idiolects match the rest of that characters dialogue text.
Lynch (2009, p 17)
Most frequent words were examined using the Delta statistic described in Burrows (2002), which has since become the textual metric de rigeur in the field of digital humanities.
Recently, Rybicki (2006) looked at character in translation, visualising the characters of Henryk Sienkiewicz’s epic dramas by character type over two translations, which inspired some more work by Lynch (2009) which attempted to do the same for Henrik Ibsen in translation. The following plot displays the preservation of character style in translation by clustering using the 100 most frequent words.
Plot of main characters in Ibsen’s Ghosts by an early version of Rybicki and Eder’s Delta plotter for R
Very recent work by Jon Reeve investigated class in Shakespearean drama, drawing distinctions between the language of kings, queens and others using computational stylometry.
Lin (2011) not only analyses the language of character in 862 film scripts and learns to categorise characters by genre, gender, director and film period with varying accuracy, they also use their learned models to generate speech for computer game characters in the style of a particular film character, controlling parameters such as hedges and use of stammer.
|Annie Hall: Alvy.||Indiana Jones: Indy.|
|I don’t know. People say Cartmill is st-strange, alright? Err… on the other hand, I don’t rush to judgment.||I don’t rush to judgment, but people say Cartmill is strange.|
|Right, I am not sure, would you be? I will tell something you because you br-brought me cabbage.||I will tell something you since you brought me cabbage.|
|Oh I am not sure. Wolf wears a hard shell. On the other hand, he is ge-ge- gentle, isn’t he?||Wolf is gentle but he wears a hard shell.|
|I see, I don’t know. I respect Wolf, wouldn’t you? He, however, isn’t my close friend.||Wolf isn’t my close friend. But I respect him.|
|Yeah, I don’t know. Sparrow con- veys excitement to my life, so I am fr-fr-friends with her.||I am friends with Sparrow since she brings excitement to my life.|
Lines as spoken by Alvy from Annie Hall and Indy from Indiana Jones from Lin (2011)
Lin, Grace I., and Marilyn A. Walker. (2011) “All the World’s a Stage: Learning Character Models from Film.” AIIDE.
Rybicki, J. (2006). Burrowing into translation: Character idiolects in Henryk Sienkiewicz’s trilogy and its two English translations. Literary and Linguistic Computing, 21(1), 91–103.
Lynch, G. (2009). Computational Stylometry and Analysis of Style: A Study of Characterization in Playwrights (Masters dissertation, Trinity College Dublin.).
Vogel, C., & Lynch, G. (2008). Computational Stylometry: Who’s in a Play?. In Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction (pp. 169-186). Springer Berlin Heidelberg.
Vogel, C., & Brisset, S. (2007). Hearing Voices in the Poetry of Brendan Kennelly. Belgian Journal Of English Language and Literature, 1-16.
Garrard, Peter, et al (2005). “The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author.” Brain 128.2 : 250-260.
Field, Brad. (1996): “Characterization in O’Neill: Self-doubt as an Aid to Art.” The Eugene O’Neill Review 126-131.
Burrows, J. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Clarendon Pr.