Death Becomes Us

In which the author waxes lyrical about a euphemistic passive form in Irish death notices and muses on the notion of an Irish news dialect


Recently I was piqued by a Twitter message from a French lecturer in Linguistics who mused on the classic Irish news formulation “The death has occurred/taken place of X”.

He went further, verifying the phrase in the NOW corpus and showing that it was indeed almost solely confined to Irish news sources with the occasional example on sites from Commonwealth countries. Our “patient zero” example of the form is unfortunately not to be found as the NOW corpus, interesting a resource as it is, dates back only as far as 2010.

It got me thinking however about other favourite phrases of the Irish newsroom. “Out socialising” as a euphemism for someone who was out drinking is a particular bugbear of mine and I was curious to check the provenance of this egregious phrase. Another one I’ve noticed is “died tragically” which I had assumed was a euphemism for suicide, but is actually not always the case as the corpora bore out.

First, the example of tragic death appears also to be an Irish newsism, a new-phemism if you will?


Even in the one example from a non-Irish source,, we can see that the example is died tragically young, where the tragically in this case, although adverbial is a modifier on the young adverbial, whereas in all of the Irish cases the subject died tragically in or following an event (usually an accident).

Or perhaps, there’s something else going on. Searches for the phrase tragically died return a good deal more results for other dialects of English, although it is still more frequent in an Irish context. Is died tragically then in fact a grammatical quirk of Hiberno-English?

Social animals

Sure enough, the phrase “out socialising” was also considerably more common in Irish news source, with the context almost exclusively referring to being out “on the town” during the evening, compared with one example from Sri Lanka which seems to refer to more general non-nocturnal socialising.



Returning to the initial example of the death which has occurred, it is unclear where this phrase stems from. One suggestion which was posted on Twitter was that it may have been a translation from the Irish language, although this seems unlikely, as the Irish form of someone dying is fuair sé bás lit he got dead, which does not involve a passive form.

I’m sure there are many more examples of idiosyncratic forms in Irish news text and seeing as how the NOW corpus is online, maybe this will now become a vibrant area of research?

New-phemisms of the world unite!

We can but dream.


Lost in the New World

Lost in the New World

In which big data and digital humanities tools are brought to bear to shed light on a family mystery..

German poem commemorating early German settlers by Elfriede Petersen, Nova Petropolis, Rio Grande do Sul, Brazil

A simple chapel like this here

built by sturdy German hand

allowed the first forest pioneers

to trust this foreign land

The old country is lost to him

A new one he’s devoutly found

children and grandchildren born to him

By hard work here he’s bound.

(Rough translation, G Lynch 2018)


Sometimes we choose a project, and sometimes, a project chooses us.

Some time between 1920 and 1929 or so, a German airman from the border town of Bautzen in the former Kingdom, now the Bundesland of Saxony, with a surname indicative of membership of a Slavic ethnic minority whose language and rights are still protected under German law today, boarded an ocean liner (probably in Hamburg although the exact departure port has been lost in the mists of time) which would take him across the Atlantic to the New World for a new life in Southern Brazil.

From what we know of post-war Germany at the time, his journey was not uncommon, indeed German emigration to Brazil spiked during the World Wars, although prior to that the largest proportion had emigrated pre 1870, to the valley towns of the Serra Gaucha in Rio Grande Do Sul and nearby Santa Catarina, the streets of São Paulo and the plantations of Minas Gerais. As was common with recent emigres, many took on new names, identities and left families and traditions back in die Heimat, never to return.

Those artifacts which remain of our airman are photos of his international pilots license and some scanty online records of his marriage and death in Southern Brazil. Also, he had several children with at least two women, one of whom was the grandmother of a Brazilian lawyer who would arrive in Europe in 2010, where she met a computational linguist/ digital humanist, and the rest as they say, is history.

This computational linguist was always on the lookout for a historical story, some tale which could be at least partially illuminated through data, or archival analysis, or some form of exotic research which would discover family secrets and fill in the gaps left by history.


The first, and most pressing question, was when did this gentleman arrive in Brazil?

Surely there must have been a record of entry, like the Ellis Island records of the Irish in America. As it turns out, Brazil was no stranger to bureaucracy, and the Museu da Imigracão is home to a treasure trove of entry records from the Port of Santos, the biggest passenger and freight port in the country. What was even more amazing, was the fact that they had scanned copies of a massive number of entry cards going back even to the 19th Century. Hurray for the bureaucrats!

Of course, there was the small matter of getting the records from the system, luckily said computational linguist had a little bit of web scraping expertise using the excellent Python scrapy package. And so, over a few nights of spidering PDF files from the website of the museum (whose system administrators must have either been very patient souls indeed or out celebrating for Carnaval), a large proportion of the boarding cards corpus, particularly those from 1918 to 1930 were downloaded locally.

(As an aside, its possibly worth mentioning the sheer beauty of some of these documents, see the cover page below)



(Roughly translated, regarding the immigrants who depart from this port today on the vessel Fortunatta, processed by Mr Gustavo Gavotti with destination Santos on the orders of the Messrs Angelo Fiorita and company of Rio De Janeiro and in accordance with the Respectable Governor of the Province of São Paulo, in accordance with contract of the 21st of August 1894, Genoa, 27th of August, 1895)

However, the pleasure endeth herewith, as the actual records themselves were the first roadblock along the way.


Optical character recognition on the above document might work with a properly trained model, however handwritten text recognition on 100 year old scanned documents in what one might refer to as a “hard problem”.

Luckily, there are some plucky digital humanists out there trying to solve these tough challenges and even better, they are sharing the proceeds on Github.

If you manage to get access to a large cluster to build the model, the LAIA system from the University Politècnica de València has been created for explicit task of English-language handwriting recognition. Potentially with some tweaking of the language models, it may be possible to get this system running on the Portuguese records. Harald Scheidl also has a great tutorial on how to build a simple handwriting recognition system using Tensorflow, with types for dealing with cursive text.

Ocropy is another excellent toolkit for doing OCR on structured documents, and this worked relatively well on some of the ships records.

Finally, an EU project named Transkribus allows users to leverage cloud-trained models for handwriting recognition.

The past may not remain hidden for much longer.





Dr Who? Towards an attribution of the authorship of “Ireland in Tears”

In which the author wades into an age-old case of who-wrote-it , familiarising himself with eighteenth century wit, history and current affairs in the meantime.

During my Master’s studies concerning character idiolects in drama, I became fascinated with Real-Life Stories of authorship attribution and literary detective work. Reading tales of CIA helicopters descending on football fields to whisk literary detective and successful self-promoter Dr Don Foster off to Langley, VA to solve a Federal crime conjured up Indiana-Jones-esque adventures rife with derring-do and textual intrigue.

Imagine a morning of expert witness proceedings at the Old Bailey. Tea at the Ritz. An audience with the Queen and the Royal Society. 60 Minutes. Time Magazine.

The reality of course, is probably a tad more benign.

Some guy writes anonymous pamphlet. Nothing major happens. Guy is forgotten in the mists of time. A student writes an M.Sc thesis on pamphlets. Nothing much happens. Thesis is covered in the Guardian, Mild Twitter buzz. Invited talk at a Midlands Redbrick.

However, glory aside, there still remains something immensely satisfying about the settling of disputed or unknown authorship, like an Agatha Christie whodunnit, and “rewriting history”, even if it is only one footnote at a time.

A question of authorship

Recently, I was alerted to an intriguing open authorship attribution question, namely that of the authorship of a number of anonymous 18th century satirical political pamphlets concerning Irish affairs.

The topic of satire was prescient in Irish current affairs of late, as He Who Shall Not Be Named unleashed another legal gagging order, this time it was Irish satirical media outlet, Waterford Whispers News, who were the targets of the injunction. Dean Swift would probably empathise, were he around today.

Never tire of satire

As mentioned, Ireland has a long proud tradition of satirical political correspondence, going back to the days of Swift’s own Drapier’s Letters, and it is indeed here that the topics of authorship attribution and satire are united. A number of the anonymous satirical pamphlets brought to my recent attention are signed M.B Drapier, the same pseudonym used by Swift in the Letters, although the language is not believed to be Swiftian.

The author of these pieces is believed to be Tobias Smollett MD, Scottish sawbones, adventurer and erstwhile novelist, author of such delightfully titled yarns as “The Adventures of Roderick Random”. New-Zealand based historian Don Shelton, the proposer of the notion that Smollett was the author of these anonymous satires, first encountered Smollett in an unrelated project on 18th century murder and man-midwifery.

Disputed Works

The first pamphlet in question is titled Ireland in tears, or, a letter to St. Andrew’s eldest daughter’s youngest son.The catalogue entry at Marsh’s Library Dublin describes it as:

An attack on Lionel Cranfield Sackville, Duke of Dorset, with special reference to the case of Arthur Jones Nevill who was expelled from the Irish House of Commons on Nov. 23, 1753.

A riveting read in which the author expresses, amongst other things, a fondness for the Aul’ Sod and a particular Dublin University, Shelton presents a relatively solid case for authorship in his blog post.

“The use of the pseudonym Major Sawney MacCleaver as the author, who appears as a character in previous works by Smollett, the references to the author’s Scottish origin throughout and the very title of the pamphlet (Ireland in Tears) being undeniably similar to one of Smollett’s most famous works, The Tears of Scotland, indicates at least the possibility that Smollett is a likely candidate for authorship. Other pamphlets which may have been authored by Smollett are presented in the end notes of Ireland in Tears.”

These include such jolly titles as:

A Genuine Letter from a Freeman of Bandon to George Faulkner,Occasioned by a Lying Extract of a Letter from Bandon, Inserted in His Journal the 24th of December Last, Dublin, 1755, 11pp. A Genuine Letter from a Freeman of Bandon, to George …

A Vindication of the Ministerial Conduct, of his Grace the Duke of Dorset
, London, M Griffiths, 1755, 24pp., by a servant of the Crown, (Eleazar Albin?), London, M Griffiths. A Vindication of the Ministerial Conduct, of His Grace the …
And one attributed commonly to Swift

The Imitation of Beasts; or, the Irish Christian Doctrine, Dublin, J Swift, 1755.

By implication, Smollett likely authored other Irish themed pamphlets, including:

A Just and True Answer to a Scandalous Pamphlet Call’d, A Genuine Letter from a Freeman of Bandon to George Faulkner, 1755, 15pp.

Ireland Disgraced, Or, The Island of Saints Become an Island of Sinners: Clearly Proved, in a Dialogue Between Doctor B–tt and Doctor B–ne, by (?) John Brett (Rector of Moynalty), Dublin, S Hooper and A Morley, 1758, 75pp.

A Letter to the Right Hon. the Lord *******, with an Account of the Expulsion of the Late Surveyor-General from the House of Commons of Ireland, in Answer to Thoughts on the Affairs of Ireland.

Known works

 Smollett is perhaps best known for his delightfully-titled series of picaresque novels :

  • The Adventures of Roderick Random
  • The Adventure of Peregrine Pickle
  • The Adventures of Ferdinand Count Fathom
  • The Expeditions of Humphry Clinker

and as a translator of Don Quixote and French Picaresque classic Gil Blas. Apparently quite beloved in past years (George Orwell was a fan) , he has lately slipped out of the public consciousness. Which is quite a shame, as he is eminently quotable, as per the below example from his personal correspondence.

“Nothing agrees with me so well as hard exercise, which, however, the Indolence of my Disposition, continually counteracts “

“Some folks are wise, and some are otherwise”


 In order to investigate authorial style similarities, it will be necessary to compile a corpus of Smollett’s writing. As the genre of the pamphlets in question are most similar to letters, a collection of Smollett’s personal correspondence was made, extracted from (Noyes 1926) in the HathiTrust Collection. This correspondence consists of ca. 70 letters dating from the period 1737 to 1771, when Smollett died in Italy of complications arising from tuberculosis. It has been collected and OCR-corrected to allow for comparison with the anonymous pamphlets in question, and also the writings of Smollett contemporaries such as David Hume, Jonathan Swift, David Garrick, Laurence Sterne, Samuel Johnson and others using the General Impostor method, currently sweeping the boards at authorship attribution shared tasks such as the PAN series.

There may be issues with different versions of 17th century texts, some of which has been “corrected” to bring it up to modern typographic standards in Project Gutenberg and some of which Remains of a Stylistic Temperament more Fitting to the Period.


Smollett was the Editor of the London Monthly Review for a number of years and according to Shelton, there are likely hundreds, if not thousands of anonymous letters from his hand, although not all are even available online. An authorship attribution problem such as this is reminiscent of a recent find of an annotated copy of Charles Dickens’ All The Year Round, attributing a number of hitherto anonymous articles which scholars had puzzled over for decades.

After all, good satire, like art, should stand the test of time.


The Letters of Tobias Smollett, M.D., collected and edited by Edward S. Noyes (Cambridge: Harvard University Press, 1926).

Koppel, Moshe, Jonathan Schler, and Elisheva Bonchek-Dokow. “Measuring Differentiability: Unmasking Pseudonymous Authors.” Journal of Machine Learning Research 8.2 (2007): 1261-1276.

Foster, Donald W. Author unknown: On the trail of anonymous. Macmillan, 2000.

Never Set A Word*: My Own Private Fitzcarraldo….

In which the author describes a long-running project on engineering creative headline generation using a range of natural language processing libraries and draws analogies to a Herzogian tale of obsession.

*Sample title generated by Pundit on a draft of an academic paper on this topic.

  The idea of the epic or quest is as old as history itself and has been depicted at length throughout Classical Antiquity, through the tales of the Crusades and Norse sagas right on up to Melville’s whaling epics, the novels of Thomas Pynchon and the cinema of Werner Herzog. Some people, (generally, but not exclusively male) attach themselves to chasing insane dreams which may come to nothing. “Think big, young man” someone said once, and they never stopped believing.

The 1982 motion picture Fitzcarraldo tells such a tale, of an Irish(?) rubber baron who transports a steamship through the Amazon, portrayed in typical manic fashion by Herzog muse Klaus Kinski.

Stranger than fiction is the parallel insanity of creating such a film itself, on location, in the Amazon and the mad dream of Herzog and his quarrels with natives drawing comparison between the German director and the real-life Peruvian Carlos Fitzcarrald on which the film was based. Not a million miles away from the grand designs of modern titans of industry such as Howard Hughes and Elon Musk and their projects involving  titanic aircraft and interstellar transportation.

However, my own Amazonian steamboat, White Whale or Mars One, a.k a. the folly quest that just won’t die is the Pundit system.

 Initially conceived as a side project during a particularly fallow period during my Ph.D years, it was a system which allowed the user to input phrases and receive a number of nonsensical puns, as seen below:

  • Harry Potter and the Half Blood Prince -> Hairy Potter and the Half Cut Prince
  • The Motorcycle Diaries -> The Motorcycle Dowries

It has in the years since morphed into something beyond a mere distraction, incorporating context and a barrage of natural language techniques to simulate linguistic creativity on topic and on tap. In the meantime, a pun generator appeared online, using Wikipedia as a data source and providing similar functionality in a dynamic fashion. The only missing feature was context.

The Pun also Rises

Long derided as the lowest form of wit or dad jokes, they still retain favour among news outlets such as The Economist, who have eked a lot of headline value from puns concerning the Chinese economy (Yuan small step for man..) However coming up with puns which are both creative and relevant to an article is by no means a cognitively cheap endeavour. Given my interest in natural language processing, I began to think about how such a process could be engineered or at the very least, augmented. Imagine a web service that suggested relevant on-topic creative article titles to you as you compose a blog post or news article? A Clippy for puns?

It started with a spark

As an example of the process, take the article title (Love Me Tinder) which has been common of late in articles discussing a certain swipe–happy dating application. Almost over-used to the point of cliché, I counted a handful of articles and even a short film with this title at the time of writing. As the zeitgeist has taken a shine to this particular phrase, it should be a prime candidate for deconstructing the mechanics. The very name of the dating app Tinder has connotations of matches, sparks and other such metaphors used for coupling and finding love. Such an article may also contain mentions of terms such as love, sex, dating, marriage, apps, technology, iPhones and other related terminology.

Imagine for a moment that we have access to a large database of stock phrases, book titles and cultural memes (such a database existed and was used in the early version of this system, although has been recently acquired and will be taken offline). We could query this database for phrases that contain our keywords, or combinations of same (Game Set Match, Match Point). As an added constraint, we also search for phrases that contain one of our keywords and a term that sounds like another keyword. Tinder sounds sort of like tender and cinder and our search now returns a number of results containing tender.

  • Tender is the Night – (1934) F Scott Fitzgerald novel, (1999) Blur song
  • Try A Little Tenderness (1966) – Otis Redding song
  • Love me Tender (1956)Elvis Presley song and film

Once we’ve obtained a list of possible candidates, now the fun begins!

We need to rank these according to some combination of metrics so that the most appropriate rise to the top.

In this case, it seems pretty clear why Love me Tinder is the ideal (albeit overused) title. It encapsulates the name of the product (sort of), a related topical concept (love) and top to it all, it is the title of not one, but two creative works featuring one of the most beloved recording artists of all time. I’m still partial to Try a Little Tinderness (1080 Google hits) or even Tinder is the Night, (18,000 Google hits) although the internet and the publishing industry appears to think differently (41000 Google hits for Love me Tinder). 

Now we’ve seen the process in action, can we automate it computationally?

We can do wordplay for you wholesale

 Going back to our article, our first goal is to extract topics. The very definite of topics is vague, however generally we want to group keywords and phrases by semantic relatedness. In the example above, we can imagine a dating topic (love, date, romance, marriage, couple, heart) and a technology topic (app, technology, internet, web, data, swipe..).

 There are numerous algorithms which attempt to extract topics from text. One of the most widely used are the family of topic modeling approaches, the most popular of which are LDA and NMF.

There are issues here however, and using these algorithms may require a large corpus of text, which is fine for longform articles but not necessarily feasible for shorter opinion pieces or breaking news. So, in the spirit of the project, I implemented my own approach.

The first part which post hoc appears similar to the RAKE algorithm looks like:

  • Remove stopwords based on a standard list
  • Keep any tokens that appear at least twice in the text
  • From this list, keep only nouns and verbs.

Video, Audio, DISCO

With the resulting list of n keywords, I computed the semantic similarity for each keyword with all of the others using an external library called DISCO. This tool allows comparison of words by semantic similarity similar to word2vec.

(K)-Means to an end

This matrix of word similarities is converted to a Euclidean distance matrix and then a k-means clustering is carried out to group the tokens into topics. As with any unsupervised clustering, the trick here was to compute a good number of clusters, not too many or too little. I settled on five topic clusters as a rough rule of thumb, although there are existing methods to determine the optimum number of clusters if the additional processing time is available. One extra outside case was added which checked if the number of keywords is less than ten, in this case only three clusters are computed.

Hanging on the Metaphone

 Once we have a set of topic clusters, the next step involves augmentation of these. There are two processes for augmentation, semantic and phonetic. 

The first step involves leveraging the DISCO API, and obtaining the top five most similar terms for each term in the topic cluster. Any multi-word terms returned are discarded. Another possible approach here would be to use datasources such as ConceptNet or Wordnet to find synonyms and semantically related terms.

The phonetic step is where the pun mechanism is incorporated, and this involves using the Metaphone library to return the top five most similar sounding terms for each of the topic terms. I’d like to think inventor Lawrence Philips was aware of the deviant use cases for his orthographical matching technology described in his 1990 paper title which riffs on a New Wave classic made famous by Blondie.

Memes to an end

As we saw above, tinder sounds like tender, cinder and possibly other more obscure terms, and these are added to our topic list. For each topic cluster pair, we try to find stock phrases, titles and memes that contain a term from each. However, there is of course the possibility that the vast majority of juxtapositions are unlikely or ungrammatical, so we employ a filtering step.

For each word pair, we query a language source to see if the words co-occur in an existing corpus sequence of length five. If this is the case, they are probably a sensible match. The original point of this step was to reduce the query load on the Freebase database, so in theory, it can be skipped if running locally, although it cuts down on possible query pairs in a neat way. The only downside is that you run the risk of missing a possible creative pairing that may be slightly less common in your language model. I used the COCA 5grams set, but the Google Books Ngrams corpus may be a wiser move, depending on computational storage space available.

Pair programming

Once a list of possible pairs are obtained, then we query against our source corpus. The Freebase database was used in the prototype system, and textual queries returned book titles, song titles, names musical groups and titles of film and tv programmes.

There are a number of possible matches given our two query terms, e.g tinder and love. Pun matches are restricted to content words only, e.g nouns or verbs, although POS taggers don’t work so well on short sentences and titles (which can actually benefit creativity).

  1. Phrase matches tinder : none
  2. Phrase matches love : Love Hurts, Love Bites, Love in the Time of Cholera….
  3. Phrase contains tinder and love : none
  4. Phrase contains Pun(tinder) : Tender is the Night, Tenderness, The Tender Trap
  5. Phrase contains Pun(love) : Hand in Glove, Maple Leaf Rag
  6. Phrase contains Pun(tinder) + love : Love Me Tender
  7. Phrase contains Pun(love) + tinder : none
  8. Phrase contains Pun(tinder) + Pun(love) Tender Glove, Tender Leaf

Given these combinations and depending on the size of our corpus, we may obtain hundreds of plausible results. It was decided only to focus on the specific combinations (3,6,7,8)  in this case.

Rank and File (IO)

Even with the added restrictions, the system may return a high number of results. Accurate filtering is required to ensure the system is useable in a real application:

  1. Remove duplicate titles (These could also be incremented and used in sorting)
  2. Remove long titles (Optimum length of >=2 and <=6 was used in the experiments)
  3. Remove subtitles and bracketing
  4. Remove non-English titles (more frequent than we might think)

Once the filtering is done, the remaining output must be ranked.

Currently, ranking is done based on:

  1. Length : rank shorter phrases higher
  2. Edit distance between output with source phrase : ranks puns lower
  3. Semantic distance between topics and pun : Attempt to rank based on similarity with original topic
  4. Corpus frequency of original keywords : Are these commonly occurring terms?

The ranking is not an exact art, and several other methods could be useful, for example:

  • Sentiment analysis: Is there a discrepancy between the tone of the title and original article?
  • Additional terms occurrence: Do the other non-topic terms in the phrase occur in the original article or not?
  • Grammaticality and/or euphony: Does the phrase flow, is there an even or odd number of terms, does it rhyme, (If the glove don’t fit, you must acquit)


A trial evaluation was carried out, where the system was fed ten articles and generated a number of titles for each. These titles were ranked using the initial four criteria, and a number of candidates were presented to the user.

100 titles in total were presented for ranking, and seven non-expert users were asked to give values between one and five for:

  • Grammaticality: How does the title read
  • Relevance: Does it correspond to the article
  • Appropriateness: Could it be construed as offensive to print this?
  • Creativity: Can this title be classed as creative?

Lessons learned from this evaluation is that reading ten articles is a bit too time-intensive for willing volunteers. Future evaluations will contain less stringent reading tasks, perhaps a paragraph summary to allow users to evaluate titles. Other critiques bemoaned the lack of a baseline headline for evaluation, in this case the actual article title could be used to compare user preferences.

Sample output.

Below is a list of the original ten articles, side by side with generated titles which were preferred and disliked. Links are given to the original creative title which spawned them.

Link Ranked above average Ranked below average Original title/topic
1 Obama of the People, Other People’s Good News People, News and Views, People in the News On Twitter, a number of high profile users dominate the conversation
2 Cancer of the Country Age of Cancer, Country Blues, Number One British have lower rates of cancer, less likely to survive than Americans
3 Shellshock, Day Late, Dollar Short, Another Day, Another Dollar Two Dollar Day, The Dollar-a-year Man Greece’s weak debt-ridden, jobless future
4 Mother Misery’s Favourite Child, Father Music, Mother Dance, Pre-Paid Child Support, Some Mother’s Son Sweet Child O’ Mine, Sleeping Beauty Overture The divorce divide: How the US legal system screws poor parents
5 none Fear, Anxiety and Depression, Rip Than Thing Tech has a depression problem
6 The Peanut Butter Solution, Pay Beyond the Bill, Helping People Help People Wandering Child Come Home, The Peanut Butter Genocide A US Cafe’s peanut butter sandwich charity campaign
7 The Food Fist Way, Food Time for Change, Soul Food Song Farm Lacy’s Kitchen, When Something is Food The Norwegian women making a song and dance about farming
8 Glitch in the System, January, February Video Computer System The numbers that lead to disaster
9 Wisdom of Life, High Rise Low Life, Hi-Fi Low-Life Modern Life, Changing People, High on Life, Where Low Life Grows The downsides of being clever
10 Half Man Half Woman, World of a Woman, No Man’s Woman Wicked Woman, Foolish Man, Woman Beat Their Man, Thirteen Woman What Norway can teach the US about getting more women into boardrooms

Transferrable skills

Turns out principles of software engineering are actually pretty useful.

When you have a bunch of vaguely related linguistic resources, it makes sense to pre-compute similarities, load semantic matrices into memory and organize efficient data structures for Metaphone search.

Otherwise, the system can actually take hours to generate headlines for a single article.

Other ideas to implement in the future are an interactive web interface which allows users to manually enter URLs and/or topic lists for the system to operate on. A trace feature explaining each step of the process may help users in understanding the building blocks of the creativity process.


Of course, with any research topic, once one digs deeper a wealth of associated research is found.

Carlo Strapparava and his team at FBK Trento have been working on computational linguistic creativity for decades, automating advertising slogan creation, company name brainstorming and even the (excellently titled) EU project HahaCronym. The latter project gifted the world the popular Wordnet-Affect resource, a valuable side product for a grant whose initial goals were to create humorous acronyms a la Central Ineptitude Agency or the Fantastic Bureau of Intimidation and a veritable poster-child for the benefits of “basic research”.

The New Yorker magazine in conjunction with Microsoft is carrying out trials of AI to evaluate its famous cartoons.

A group of Finnish researchers recently developed a system that creates semantically vacuous yet plausible raps from a database of existing rap lyrics.

Deep learning can be used to train chat-bots with the sum total of knowledge from the Golden Age of Hollywood, so they can answer Big Questions with a cynical slant.
Frankly my dear, I don’t give a damn
Maybe the idea wasn’t so crazy after all.

Research Retrospective #5: Source (Language), (POS) Tags and (Secret) Codes : Investigating the implicit stylistic patterns in translated text

In which the author investigates the uncanny style of translated text using machine learning and finds, among other stylistic features, evidence of national stereotypes in literature. 

Poetry is what get lost in translation

Robert Frost 

Reading a poem in translation…is like kissing a woman through a veil.

Anne Michaels


When reading a translation, particularly a literary one, one often becomes aware of a certain unheimlich nature to the prose, as if an unknown force is dragging down on the sentence structure and creating an eerie hint of disfluency. The old cliché speaks of meaning and feeling being lost in translation, but we can also view a translation as having imperceptibly gained a certain something. Thus, the textual style of a translation is neither that of the source language or the target, but a language apart, often referred to as the third code. Translation studies scholars also use the term translationese to describe the subset of a language consisting solely of translations into that language, and computational linguists have become curious of the properties of this uncanny dialect.

A large proportion of translation-related research in the field of computational linguistics focuses on training machines to do translation, so once researchers (Kurokawa 2009) figured out that the direction of your parallel corpus could be useful for MT (DE-EN corpus direction for DE-EN translation, for example), attention was again given to this subject of translationese and translation direction detection which had attained something of a academic “cult following” among mostly translation studies scholars.

During my doctoral work I focused heavily on the stylistic properties of translations, and one of these properties concerned the source language traces found in texts. This is especially prevalent when one is familiar with the source language in question:

e.g. “This sentence like German reads”

The task of detecting the features which illustrate this was a interesting challenge for machine learning tools. A fine-grained problem such as this one locates itself within the realm of stylistic classification, alongside questions such as native language identification, personality detection, sentiment analysis, gender detection, temporal period classification and others.


As the existing literature focused on commonly used corpora such as EUROPARL (van Halteren 2008) and large multi-lingual newspaper collections (Koppel and Ordan 2011), we decided to examine literary text, an oft-neglected genre in traditional text analytics. The first port of call for literary texts is usually Project Gutenberg, although this usually limits one to texts from a specific historical period. In order to keep all confounding factors constant, the texts were all drawn from 19th century prose translations. The source languages examined were Russian, German, French and original English, and five novels were obtained to represent each set.

In order to negate any effect of individual translatorial or authorial style, no translator or author was repeated in the corpus. This work represented the beginning of my analysis of the dichotomy of textual features. N-grams, or sequences of characters, words and parts-of-speech are a potent force in text analytics. On the other hand, a text is made up of more than discrete sequences of characters, and another set of features can capture larger deviations of style. This set of metrics includes readability scores, used to measure textual complexity, metrics such as type token ratio and the ratio of various parts of speech to total words.

With the corpus assembled, random chunks of text were extracted from each work. Each textual chunk consisted of two hundred kilobytes of text, and this chunk was further divided into five equal sections. Each source language was represented by five works, these can be viewed in the paper.


Using the eighteen document metrics only, the system results were relatively low. Given that the classification problem was a four-way affair and all classes were balanced, an accuracy of 67% was obtained using a Support Vector Machine classifier the ten-fold cross-validation method. Comparably using word unigrams features only and doing feature selection within the cross-validation folds, the system reported accuracies of nearly 100 percent, dropping off at the top 100 word features.

Content words

 The mostly highly ranked features in the word unigram set consisted unsurprisingly of content words. Words such as francs, paris, rue and monsieur characterized translations from French, with German texts talking of berlin, and characters named von <something>. The Russian translations were similarly marked with content words such as cossack and names such as Anton, Olenin and the like.

To test the robustness of the classifier, the top 200 features in a mixed set of bigrams, POS bigrams and word unigrams were selected, and all nouns were removed. Using the remaining fifty features and a Simple Logistic Regression classification algorithm, the sparse feature set managed an accuracy result of 85.5%.

A little untoward

 The adverbs toward and towards were found to be discriminatory for texts translated from German, and this was where a confounding factor in the data came into play. The translations of the German texts had all been published in the US where the term toward was more prevalent than towards. This particular trait is perhaps not associated with source language as strongly as other frequent word tokens.

 A contraction in terms

Contractions were found to be discriminatory, Russian had higher frequencies of the contracts it’s and that’s than French, which reported higher frequencies of the non-contracted form. In a contradictory fashion, translations from Russian reported a higher frequency of both I am and I’m than the other texts, French and German reported higher frequencies of the non-contracted forms.

Exact reasons for this behaviour remain difficult to pinpoint. On the one hand both French and German display the first person be form as two words (Ich bin and je suis) which could influence the translated text more towards the expanded form. On the other hand a Russian-speaking conference attendee who saw the work at Coling didn’t seem to think the Russian examples were related to any Russian language-specific transfer and the source of this phenomenon remains to be investigated.

The adverbial conjunction

 The translations from French reported a higher frequency of the POS n-gram RB-CC which translates as an adverb and coordinating conjunction pair. The below figure shows extracted occurrences of this n-gram in the French translation corpus, based on the following simplified text search:

grep “ly and” FrenchCorpus.txt

We can even see that in a number of cases the coordinating conjunction joins pairs of adverbs (RB CC RB) together, which may represent a grammatical structure more common in the original French.

…..No head was raised more proudly and more radiantly….. … offer which she eagerly and gratefully accepted….. …..unceremoniously and with no notice at all……. …..But after this I mean to live simply and to spend nothing….. …..I placed myself blindly and devotedly at your service…… …..Outwardly and in the eyes of the world …..They had parted early and she was returning home…… ……as the English law protects equally and sternly the religions of the Indian people….. ……vain attempts of dress to augment it, was peculiarly and purely Grecian…….

(Lynch 2012)

Other source-language distinguishing frequent tokens included:

  • Russian: anyone, though, suddenly, drink (allowing one to indulge a national stereotype 😉 )
  • French: resumed, towards, thousand, (apparently related to the denomination of francs)
  • German: nodded
  • English: presently, sense (common or otherwise)

Regarding the document metrics, the following were discriminant:

  • Russian: Ratio of finite verbs (higher) ARI readability score (lower)
  • French: Ratio of nouns (higher) , ARI readability score (higher), ratio of conjunctions (higher)
  • German: Ratio of prepositions (lower)

Testing the waters

 The document-level trained model was tested on an unseen corpus of contemporary literary texts. The model managed only 43% accuracy, compared with the 67% cross-validation result, however this was still above the baseline of 25%. A more training set comprised of a larger and more diverse range of texts may results in a more robust classification model.

Final thoughts

 This study focused on detecting patterns in literary translated text indicative of source language. The study identified a number of such effects both in terms of ratios of different parts of speech combinations, frequencies of individual common words and also frequency of content words. A classifier was trained which performed well on the training set but exhibited a lower accuracy on an unseen test set of fresh literary translations.

Of course, this work is still at a relatively nascent stage. The coarse grained nature of the corpus (five texts per language only) meant that any features learned could be heavily biased towards those particular texts, as seen in the evaluation results on the hold-out unseen set. A larger set of source texts would have resulted in a more generalizable model.

An expansion on this theme currently under peer review focuses on an expanded feature set and brings syntactic parse features into play on a larger (8 language + ca. 400 text) set of contemporaneous translations which can hopefully capture a deeper sense of structural transfer from the source language.

Content words have been ignored in this study as it was believed that they tended to capture topical distinctions rather than stylistic idiosyncracies. This distinction may be somewhat crude, as it can be difficult to separate what are topical norms, what are cultural norms and what are trends of linguistic transfer?

Obviously, mentions of Muscovites and cossacks might lead us towards Russian as a likely source language for a translation, but these features are not robust, as plenty of Anglophone authors (for example) may also write novels set in those regions. Likewise with the higher frequency of drink and snow. These can reflect socio-cultural norms, perhaps more frequent in texts from a particular literary tradition but do they represent true source language effects?


Perhaps applications of textual clustering metrics such as LDA will shed light on topical clusters related to culture and tradition in literary corpora.


 This experiment sought to shed light on source-language specific traits in translated text. A number of these traits were found, although cross-corpus testing indicates that some may be more specific to the specific literary works examined than the source language.

  • Replicability: A comparable experiment was carried out by Klaussner (2014) who examined a completely distinct set of parallel literary translations. Some differences here were the use of POS trigram features, a number of which were found to be indicative of original English. Contractions were also identified as markers of source language, together with ratios of conjunctions, average word length and type-token ratio
  • Comparability: One interesting additional step taken by Klaussner (2014) which was also done by Baroni (2006) was to present humans with the task of classifying whether a text was translated or original. Baroni (2006) found that the machine was generally more consistently accurate than the average human, although one of the ten human evaluators (an expert in translation studies) outperformed the machine. Klaussner found that human evaluators performed the task of translation vs. original classification with ca. 70% accuracy on seventeen excerpts of translated and original text, with a Kappa score of 0.406, indicating moderate agreement.
  • Transferability: A hot topic in the machine translation and computational linguistics community currently is the idea of quality estimation, roughly “Automatically detecting how bad a machine translation is and how much it might cost a human to fix it”. Approaches similar to those used here could be used to determine how similar a translation is stylistically to a corpus of translations from the same source language, with one train of thought imagining that the more stylistically similar to the source a translation is, the more post-editing required?
  • Expandability: Another area of research seeking to identify textual interference features is the field of native language detection. This seeks to identify native language influences on an author’s L2, with various applications for second language learning and author profiling. An interesting experiment could compare non-native writing and translation from the same L1 to investigate similarities/differences related to transfer.

 Further reading/References:

Linguist list post about translationese

Baroni, M., & Bernardini, S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing.

Kurokawa, D., Goutte, C., & Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings. MT Summit XII, The twelfth Machine Translation Summit International Association for Machine Translation hosted by the Association for Machine Translation in the Americas.

Klaussner, C., Lynch, G., & Vogel, C. (2014). Following the trail of source languages in literary translations. In Research and Development in Intelligent Systems XXXI (pp. 69-84). Springer International Publishing.

Koppel, M., & Ordan, N. (2011, June). Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1318-1326). Association for Computational Linguistics.

Lynch, G., & Vogel, C. (2012). Towards the automatic detection of the source language of a literary translation. In 24th International Conference on Computational Linguistics (p. 775).

van Halteren, H. (2008). Source language markers in EUROPARL translations. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1 (pp. 937-944). Association for Computational Linguistics.

Text Time Round AKA In Search of Lost Time(stamps)

After a short interlude, a tale of how the author got his diachronic text classification groove back, following from trying to date the Donation of Constantine a few years previously.

Time isn’t holding us
Time isn’t there for us

Talking Heads
Once in a Lifetime


Language is in constant flux. New words are coined at a speed which the official lexicographers can struggle to keep up with, and older ones fall out of favour gradually. Culturnomics is a loosely defined mashup of data and social science which uses large databases of scanned and OCR’ed book text to investigate cultural phenomena through language usage. These large textual databases allow for hitherto impossible queries over our mass cultural heritage.

Late last year I attended a presentation by Dr. Terrence Szymanski of the Insight Centre for Data Analytics in UCD. He presented preliminary work on a very interesting problem in text analytics, namely the chronological dating of a text snippet.

This work was related to a shared task at the SEMEVAL 2015 workshop, for the non-initiated, a shared task is a form of academic Hunger-Games where different teams battle it out to obtain the best performance on a set academic challenge normally using a shared dataset.

Philosophical questions about such methods of engaging in research aside, a shared task does have advantages, namely a constrained set of a parameters and data and to the victor, the spoils (usually only honour and glory, but sometimes prizes). This methodology was adopted by the data science challenge site Kaggle, where companies can offer challenges and datasets for real monetary reward.

My interest piqued by the description of the challenge, having encountered the same subject in the doctoral work of a former colleague at TCD, we set to work on a text analytics system which could tackle this particular problem.


Delving into the literature, it seemed there were three main schools of thought on the subject of diachronic text classification:

The first was likely born out of research by corpus linguists and dealt with the self-same task of dating a text chronologically, however focused on what I like to refer to as document-level metrics such as sentence length, word length, readability measures and lexical richness measures such as type-token ratio etc. This group looked at large traditional language corpora such as the British National Corpus or the Brown Corpus. (Stajner 2011)

The second faction concerned itself with smaller, more quotidian matters, such as the temporal ordering of works e.g. Early< Middle < Late Period by a particular author or group of authors, rather than labelling a work with a specific year of composition.

Work by (Whissell, 1996) looked at the lyrical style of Beatles compositions and found them to be

“ less pleasant, less active, and less cheerful over time “

Other works focused on ranking Shakespearean drama and Yeats plays by composition order, however common traits were the small-scale nature of the corpora and the focus on creative works in the humanities, (Stamou, 2008)

The final third group of studies had a different focus, namely the quantification of word meaning change over time. This could then be used to infer temporal period, (Mihalcea 2012)

Take for example the Google Ngrams plot for the word hipster.


Three extracts from the corpus below show the shift in meaning for this term over time. The word originated during the Beat Generation as a slightly derogatory term for a (generally Caucasian) scenester who made his business to hang out with jazz musicians and imitate their sartorial style.

As the Fifties drew to a close this bohemian counterculture archetype was replaced by hippies, draft dodgers and free-love enthusiasts but the term enjoyed a surprising renaissance since the 2000s as evidenced in the third extract.

Examining the word windows in each era, we might expect our 50’s hipster to collocate with jazz, music, instruments, groove etc, while the hipster of today has different bedfellows, irony, Pabst Blue Ribbon, thrift stores, fixies, etc.

  1. (1956) The Story of Jazz, p 223: The hipster, who played no instrument, fastened onto this code, enlarged it, and became more knowing than his model.
  2. (1988) Interview with Norman Mailer: Well, you would say that hipsters do this in a vacuum. I don’t. It’s just that a hipster’s notion of morality is so complex. A great many people hate hip because it poses a threat to them. (Kerouac, Beat Generation, Jazz)
  3. (2009) Time Magazine: Hipsters are the friends who sneer when you cop to liking Coldplay. They’re the people who wear t-shirts silk-screened with quotes from movies you’ve never heard of and the only ones in America who still think Pabst Blue Ribbon is a good beer.”


  • Assign a date range to short news text
  • Dates range from 1700 to 2014
  • Training set contains 3100 texts, 225k words (71 words per text)


 Our approach to the problem fit squarely in the first camp, given a snippet of text, can we estimate a date of production. In reality, an exact date match was not expected (although the system often came close!), and the results were evaluated with respect to temporal spans of 6 years, 12 years, 20 years and 50 years.

The text was represented using a number of main feature types

  • Character n-grams (d_w_i),
  • POS tag n-grams (DT_NNP)
  • Word n-grams (the_man_who).
  • Syntactic and phrase structure ngrams: slices of a sentence, (S -> NP VP) and also terminal nodes (N -> cat).

The syntactic n-grams contained information about semantic role (subject/object of a sentence) in addition to part-of-speech and terminal information.

Two main approaches were taken to generate features. One set of features were generated from the shared task textual corpus itself, and the other set were taken from the Google Syntactic N-Grams corpus which is a date-tagged corpus of n-grams extracted from millions of books.

External corpus features

The first step required calculating a number of probabilities for words given a year, and then multiplying these together and obtaining the max probability, making the naïve assumption that years are uniformly distributed in the corpus.

p(“hipster”|1956) = 0.2

p(“hipster”|2003) = 0.3

p(“Pabst Blue Ribbon”|1956) = 0.0

p(“Pabst Blue Ribbon”|2009) = 0.2

The example (2-word) document contains both hipster and Pabst Blue Ribbon.

p(“hipster Pabst Blue Ribbon”|1950) ~= 0.5 * 0.0 = 0.0

p(“hipster Pabst Blue Ribbon”|2009) ~= 0.3 * 0.2 = 0.06

In reality, we used the log probability and normalized it in the range (0,1)

This feature set worked fairly well for classification, with the following cross-validation values on the training set using a Naïve Bayes classifier, and the probabilities of the words in the document.

An improvement was obtained when these probabilities were used as a feature for a Support Vector Machine Classifier. 309 features were generated for each text, these were the normalized log probabilities for each text being written in each of the years for which documents were present in the training set.

Internal corpus features

The other feature set used was generated from the corpus itself. The entire training corpus was tagged with the Stanford CoreNLP tools and a number of feature types were employed. Ngrams of length <= 3 were used for words, characters and POS tags.

The feature set generated was large, with 11,109 features in total. Reducing the feature set size using feature selection improved the classification results.

Special features

The most highly ranked features in the full feature set using the Information Gain metric were POS tags and character unigrams. These included the NN tag for common nouns, the fullstop (.) and other metrics (ROOT->S) from the grammatical parses which are likely a proxy feature for sentence length.

A number of letter unigrams (i,a,e,n,s,t,l,o) were in the top 20 most discriminating features.

Reasons for the prevalence of these characters are currently difficult to articulate. The English language contains an uneasy marriage of Latinate and Germanic vocabulary and a shift in usage could manifest itself in the frequencies of certain letters changing. A change in verbal form or orthography (will -> going to, ‘d to -ed) for example, could change the ratio of character ngrams in documents.


Below is a list of the fifty most distinctive word n-gram features from the feature set:

the, a, . the, in, is, on, said, it, its, and, of, of the u.s., president, government, has, american, today king, majesty, united, international, the united, the said, would, to, it is, messrs, minister, national, as letters, special, china, official, on the, public, central economic, not, more, mr, in the, million, can, was recent, prince, chinese, dollars, talks, Russian,
she, group, south, that, the king

The system identifies a number of adjectives (national, international, economic, public, central, official) which rise during the 20th century, also words related to world leaders (president, government, king, majesty, minister) and dominant countries (China, Russia, United States, American).

Various arcana such as “the said gentleman” and Messrs were on the wane as the nineteenth century drew to a close.


 And to the winner, the spoils…..

Our system trained on the full set of 11,000+ features was entered into one subtask of the shared task and it obtained the best overall result of the three competing systems. Full details on accuracy results and results in the task can be found in the papers.

Special mention must be given to the USAAR-CHRONOS team who expertly hacked the diachronic text classification task, using Google queries for the texts to assign date information based on metadata extracted from the source. Well played sirs!

Post-game analysis

Although the character ngram features performed excellently in the task, it can be difficult to interpret these results as a linguistic evolution of style. The prevalence of the period character and other sentence-specific tagging may be due to the fact that the system has identified sentence length as a discriminatory feature by proxy, although we did not measure it explicitly.

A number of competing systems took metrics such as sentence length and lexical richness measures into account and these may indeed be useful for future experimentation, in concert with existing features.

Another approach taken by a participating team was to extract epoch-specific named entities from the text and using date occurrence information in articles from an external database such as Wikipedia or DBpedia to assign date information to these texts. A processing framework to handle this could be an excellent addition to the classifier approach undertaken here and future work evaluating both approaches back-to-back would be useful.

Time after time

 When the dust settles after the battle, it is important to take stock of what has been learned.

  • A task like this is a great introduction to a research question and often necessitates getting up to speed on a particular topic in a very short space of time.
  • My Two Cents: If you have the technical know-how, it can often save time in the long run implementing your own text concordancing tools rather than relying on a mix of off-the-shelf packages roughly cobbled together.
  • Going to departmental seminars increases the chance of serendipitously sparking a fruitful collaboration.
  • Machine-optimised textual features, although useful for classification tasks are not necessarily the most intuitive for human interpretation.
  • Character n-grams do have a certain degree of “black magic” and are not all equally useful (Sapkota 2015), although their flexibility captures syntactic (gaps between words, period frequency as proxy for sentence length), morphological and orthographical shifts (‘d -> ed, -ing) and semantics (short words)
  • More focus in future studies should be given to variables such as sentence length, type token ratio and other statistics computed on an entire text.

References and Further Reading:

 Frontini, G. Lynch, and C. Vogel. Revisiting the Donation of Constantine’. In Proceedings of AISB 2008, pages 1–9, 2008. (Earlier blog post on this work here)

Stamou, C Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating. In Literary and Linguistic Computing, pages 181-199, 2008.

Forsyth, R, Stylochronometry with substrings, or: A poet young and old. In Literary and Linguistic Computing, 14(4), 467-478, 1999.

S Stajner, and R Mitkov. Diachronic stylistic changes in British and American varieties of 20th century written English language. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, RANLP,2011

Goldberg and J. Orwant. A dataset of syntactic-ngrams over time from a very large corpus of English books. In Proceedings of *SEM 2013, pages 241–247, 2013.

Mihalcea and V. Nastase. Word epoch disambiguation: Finding how words change over time. In Proceedings of ACL 2012, 2012.

Popescu and C. Strapparava. Semeval-2015 task 7: Diachronic text evaluation. In Proceedings of SemEval 2015, 2015.

Szymanski, Terrence, and Gerard Lynch. “UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams.” In Proceedings of SemEval 2015, 2015.

Upendra Sapkota, Steven Bethard, Manuel Montes, Thamar Solorio. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution In Proceedings of NAACL HLT, 2015

Whissell, Cynthia. “Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon.” Computers and the Humanities 30.3, 1996: 257-265.

Research Retrospective #4: A Little Out Of Character : On Computational Analyses of Dramatic Text

In which the author reminisces about a series of coincidences along the academic road and writing a Master’s thesis on computational stylometry.
(Warning, post contains misty-eyed sentimentality and may not be to everyone’s taste)

And as the spotlights fade away,
And you’re escorted through the foyer,
You will resume your callow ways,
But I was meant for the stage.

The Decemberists
I Was Meant for the Stage
Her Majesty The Decemberists


The choice to study German instead of the default French at second-level was to a great extent accidental, as the German class was in need of extra bodies to make up the numbers that year or face extinction.
Swapping the language of Baudelaire for that of Brecht led to a multi-disciplinary undergraduate experience in computer science, linguistics and language, along with a revolutionary Erasmus experience at the LMU Munich.

The author’s choice to complete a Master’s by Research was also the road less travelled at that particular junction point, and involved reneging on a planned post-college year as an English language assistant in deepest Bavaria. The offer of an all-expenses paid stipend plus teaching hours proved too much to pass up. However, this offer was also subject to the first-choice candidate dropping out of the race at the last moment, a simple twist of fate once more.

Looking back at these early halcyon days through the telescopic rose-tinted lens of hindsight, it would prove to be a rare opportunity to have carte blanche to carry out research on any topic of interest, no matter how esoteric.

The originally submitted application for the TCD scholarship concerned linguistic judgments of grammaticality, which would have also made a fine topic of study if it hadn’t been for a chance departmental seminar on the stylistic changes in a writer with Alzheimers disease (Irish-born British author Iris Murdoch ).
As fate would have it this topic would go on to be examined thoroughly in the field of cognitive linguistics and psychology (see Garrard 2005), and inspired in part by this glimpse of the power of stylometry, syntax was swapped for statistical text analysis and so began a long day’s journey into the scholarly twilight of authorship attribution and the so-called digital humanities.


The main research question dealt with during those two years concerned the textual stylometry of characters written by playwrights, or more succinctly:

Do playwrights create stylistically distinct characters in their works?

The inspiration for this work came from a study of character in the work of Irish poet Brendan Kennelly, at that time a Professor in English at TCD. Vogel (2007) found a number of recurrent characters in his poetry, in particular the character of Ozzie, fluent in the dialect of Dublin’s Northside:


ozzie is stonemad about prades
so he say kummon ta Belfast for the 12th
an we see de Orangemen beatin the shit outa de drums
beltin em as if dey was katliks heads

from The Book Of Judas by Brendan Kennelly, presented in Vogel (2007)


The methodology used was borrowed from the corpus linguistics literature, relative frequencies of n-grams were compared to one another using the chi-squared test, then for each category, within and outside category similarity functions were computed using the Mann-Whitney ranks method. Thus, a textual segment was found to be more similar to either its own category (character, play, author) or everything else.

Once the system had been created to separate character contributions from one another, the analysis could begin in earnest.

Playing a role

Playwrights (and screenwriters) were chosen from numerous epochs including:

  • Jacobean/Elisabethan (Shakespeare, Marlowe, Jonson, Webster)
  • Victorian/Celtic Revival (Shaw, Wilde, Synge)
  • 20th Century American (Eugene O’ Neill).
  • Modern Screenplays (Cameron Crowe, William Goldman)

Based on the results of the experiments, those playwrights who incorporated dialectal orthography were more likely to produce distinct characters, to sum up:

“Spelling variation to indicate dialectic variation was captured as a stylistic feature”

Characters of this nature included Swedish sea captain Chris Christofferson from Eugene O’Neill’s Anna Christie who speaks in a strange Norwegian-English patois, stylistically distinct within O’Neill and contemporaries.

“Py yiminy, Ay forgat. She say she come right avay, dat’s all. Ay gat speak with Larry. Ay be right back. Ay bring you oder drink.”

from Anna Christie by Eugene O’Neill

Another O’Neill character of note is the character of Yank from The Hairy Ape, whose Noo Yawk aphorisms are clearly marked in speech:

G’wan! Tell it to Sweeney!
Say, who d’yuh tink yuh’re bumpin’? Tink yuh
own de oith?

from The Hairy Ape by Eugene O’Neill

Or as put by more eloquently by those in the literary criticism community:

It notes that characterization in O’Neill’s one-act sea plays is largely a matter of stage-dialect.

Field (1996)

Across the class divide

The character of Doolittle, the father of Eliza, from Shaw’s Pygmalion was found to be distinctive amongst those of Shaw’s characters, apparently by virtue of his addressing Higgins in the formal manner of a Cockney squire.
Shaw does not employ the post-modern dialectical orthography but manages to convey class and dialect through the method of address.

I thank you, Governor..

“Well, the truth is, I’ve taken a sort of fancy to you, Governor; and if you want the girl, I’m not so set on having her back home again but what I might be open to an arrangement. Regarded in the light of a young woman, she’s a fine handsome girl. As a daughter she’s not worth her keep; and so I tell you straight. All I ask is my rights as a father; and you’re the last man alive to expect me to let her go for nothing; for I can see you’re one of the straight sort, Governor. Well, what’s a five pound note to you? And what’s Eliza to me?.”

from Pygmalion by George Bernard Shaw

The villain of the piece

On Shakespeare and his contemporaries, Ben Jonson’s character Tucca from the Poetaster displays a choice command of era-specific insults:

“sort of goslings, when they suffered so sweet a breath to perfume the bed of a stinkard:
thou hadst ill fortune, Thisbe; the Fates were infatuate, they were, punk, they were.

I am known by the name of Captain Tucca, punk; the noble Roman, punk: a gentleman, and a commander, punk. I’ll call her.

–Come hither, cockatrice: here’s one will set thee up, my sweet punk, set thee up.

Aha, stinkard! Another Orpheus, you slave, another Orpheus! an Arion riding on the back of a dolphin, rascal! Shew them, bankrupt, shew them; they have salt in them, and will brook the air, stinkard.”

from The Poetaster by Ben Jonson

Although J.M Synge is well regarded for doing his part perpetrating the stage “Oirish” stereotypes who have punctuated drama and film in the 20th century, his characters were found to not possess a distinctive voice, in fact he was one of the least distinctive authors in the corpus when it comes to creating character.

“Ten thousand blessings upon all that’s here, for you’ve turned me a likely gaffer in the end of all, the way I’ll go romancing through a romping lifetime from this hour to the dawning of the judgment day.”

From The Playboy of the Western World by John Millington Synge

All’s fair in love and corpus linguistics?

The main conclusions from the thesis and related work were:

  1.  In general, not all characters of the dramatists studied are created equal (ly distinct)
  2.  If they are different from the others , it’s generally due to
    a. Orthography by way of dialect (Norwegian, New York, Cockney)
    b. Use of epithets (punk, cockatrice)
    c. Rhyme scheme

Some of the features discovered related to “class distinctions” and archetypes, however shortcomings included the lack of examination of stylistic features such as sentence length, lexical richness and other combinations of features from the corpus linguistics literature.

As evidenced during the literature review and post-submission corrections for the thesis, there is actually a very rich tradition of studying characterization on a textual level, going back to the late eighties with digital humanities pioneer John Burrows’ (1987) work on Jane Austen. This study was groundbreaking in that it was not focused on drama, easily separated into character, but fiction, which required painstaking separating of speech and descriptive text:

He examines the relationship of style within particular character idiolects and using the thirty most common words in each idiolect and three passages of three hundred words, carries out tests using linear regression which assign the highest correlation between the selected dialogue passages and their corresponding character idiolects, in other words, subsections of character idiolects match the rest of that characters dialogue text.

Lynch (2009, p 17)

Most frequent words were examined using the Delta statistic described in Burrows (2002), which has since become the textual metric de rigeur in the field of digital humanities.

Recently, Rybicki (2006) looked at character in translation, visualising the characters of Henryk Sienkiewicz’s epic dramas by character type over two translations, which inspired some more work by Lynch (2009) which attempted to do the same for Henrik Ibsen in translation. The following plot displays the preservation of character style in translation by clustering using the 100 most frequent words.

Ibsen Characters

Plot of main characters in Ibsen’s Ghosts by an early version of Rybicki and Eder’s Delta plotter for R

Very recent work by Jon Reeve investigated class in Shakespearean drama, drawing distinctions between the language of kings, queens and others using computational stylometry.

Lin (2011) not only analyses the language of character in 862 film scripts and learns to categorise characters by genre, gender, director and film period with varying accuracy, they also use their learned models to generate speech for computer game characters in the style of a particular film character, controlling parameters such as hedges and use of stammer.

Annie Hall: Alvy. Indiana Jones: Indy.
I don’t know. People say Cartmill is st-strange, alright? Err… on the other hand, I don’t rush to judgment. I don’t rush to judgment, but people say Cartmill is strange.
Right, I am not sure, would you be? I will tell something you because you br-brought me cabbage. I will tell something you since you brought me cabbage.
Oh I am not sure. Wolf wears a hard shell. On the other hand, he is ge-ge- gentle, isn’t he? Wolf is gentle but he wears a hard shell.
I see, I don’t know. I respect Wolf, wouldn’t you? He, however, isn’t my close friend. Wolf isn’t my close friend. But I respect him.
Yeah, I don’t know. Sparrow con- veys excitement to my life, so I am fr-fr-friends with her. I am friends with Sparrow since she brings excitement to my life.

Lines as spoken by Alvy from Annie Hall and Indy from Indiana Jones from Lin (2011)


Lin, Grace I., and Marilyn A. Walker. (2011) “All the World’s a Stage: Learning Character Models from Film.” AIIDE.

Rybicki, J. (2006). Burrowing into translation: Character idiolects in Henryk Sienkiewicz’s trilogy and its two English translations. Literary and Linguistic Computing, 21(1), 91–103.

Lynch, G. (2009). Computational Stylometry and Analysis of Style: A Study of Characterization in Playwrights (Masters dissertation, Trinity College Dublin.).

Vogel, C., & Lynch, G. (2008). Computational Stylometry: Who’s in a Play?. In Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction (pp. 169-186). Springer Berlin Heidelberg.

Vogel, C., & Brisset, S. (2007). Hearing Voices in the Poetry of Brendan Kennelly. Belgian Journal Of English Language and Literature, 1-16.

Garrard, Peter, et al (2005). “The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author.” Brain 128.2 : 250-260.

Field, Brad. (1996): “Characterization in O’Neill: Self-doubt as an Aid to Art.” The Eugene O’Neill Review 126-131.

Burrows, J. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Clarendon Pr.

The Style Counselor : AKA We Are How We Write

On predicting authorial demographics from writing style and the implications of same.

To write it, it took three months;
to conceive it three minutes;
to collect the data in it all my life.

Francis Scott Key Fitzgerald (1896-1940)

Given the knowledge that both our personal and public data is processed on a daily basis by multinational concerns who manage, curate and regurgitate it wholesale to digital Mad Men with designs on our wallets, we may start to become a tad wary about what and how we post online.

Facebook likes have been shown to know more about our personality than family and friends (Youyou 2015), but the style of language and words we use can also be used to betray these demographics.

Language and Gender

A recent PLoSOne paper by Schwartz et. al. (2013) analysed the language used in a large number of Facebook posts and determined defining word usage frequencies for both gender, personality type and age, where related topical clusters highlight the age difference between getting wasted (19-22) and drinking responsibly (23-29).

In their dataset, on average males talk more about Xbox, sports, war and taxes replete with a higher volume of expletives, while women converse about shopping, love, family, pets and friends. Pretty broad strokes across middle America, but interesting analysis all the same.

Moving a little off topic, writing style has been found to betray gender-specific traits, Moshe Koppel’s seminal 2002 investigation into literary text from the British National Corpus established stylistic features of male and female language.

In brief, male authors use more determiners (the, a) and female authors use a higher proportion of prepositions such as for and with, coupled with a general preference towards a higher proportion of pronoun usage for female authors. Negation markers are used more frequently by women, and combinations of these features were used to label the gender of an author with an 80% accuracy rate.

Casual social media users should not be lured into a false sense of security however as systems don’t need to read your unpublished Great American Novel to predict your details, the Tweetgenie (website is in Dutch only) system takes your tweets and predicts age and gender. (This author’s demographics were predicted accurately)

Another tool, GenderGuesser from Dr. Neal Krawetz based on previous work by Argamon, Koppel and associates, predicts author gender from formal and informal text in a web browser ( Try it here). The authors claim that it reports 60-70% accuracy on guessing gender and is trained on US English, meaning that European English can outwit it somewhat, although investigations using this blog as a source appear promising.

The “Write” Profile

Although links between gender and language use have been studied extensively, personality type, native language, political affiliations and other more fine-grained characteristics can also be extrapolated from our writing style in a similar fashion.

Gill (2009) found that the blogs of neurotics reflected more negative sentiment and statements about themselves, while extraverts express more balanced emotions and references to events and happenings, choosing to focus on third parties more frequently.

Wong and Dras (2008) examined syntactic parses of the writing of non-native English speakers and found that an author’s native language could be classified with 80% accuracy on a corpus of 90 learner essays each from a sample of seven languages, (Bulgarian, Czech, French, Russian, Spanish, Chinese, and Japanese). Their machine learning system identified language-specific parse rules such as noun phrases without determiners (indicative of Chinese native speakers) and prepositional phrases such as according to which corresponded to a direct equivalent in Chinese used less frequently by authors from other language backgrounds.

Author Unknown?

These research projects demonstrate that no matter how we try to obfuscate our profiles with fake names, ages and profile pictures, our real selves can be revealed by a simple Oxford comma, self-referential tweet or more likely, a longitudinal analysis of our writing style.

This raises an interesting research question:

Could a language processing system be developed to obfuscate such variables?

An onion router for our blog posts, or a voice distortion system for our online social presence?
The human voice is frequently manipulated using sonic frequency analysis, could the same be done for our writing using word frequency analysis?

Imagine specifying a gender, personality and age setting and allowing a system to take our words and synthesise them in another style?
Such a system could also be useful for text personalisation and summarization, taking a wordy blog post and summarising it succinctly in a tweet, or dialing down the complexity of a legal policy document for a non-native speaker.

However, a darker side to the availability of such a system in the wild could be the severe ethical implications regarding child safety on the Web. A large body of research in the domain of deception detection is currently dedicated to systems to detect cyber-predators in online chat scenarios using linguistic and non-linguistic features. (See Bogdanova et. al. 2012)

As a coda to this cautionary tale, read about how J.K Rowling’s gender-bending pseudonymous work The Cuckoo’s Calling was unmasked by computational stylometry expert Dr Patrick Juola on the Language Log.

Caveat scriptor!


Moshe Koppel, Schlomo Argamon and Anat Rachel Shimoni (2002), Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4), November 2002, pp. 401-412.

Nature news article on Koppel’s work

TweetGenie press coverage and scientific basis

Nguyen, D., Gravel, R., Trieschnigg, D., & Meder, T. (2013). ” How Old Do You Think I Am?”; A Study of Language and Age in Twitter. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press.

Wong, S. M. J., & Dras, M. (2011). Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1600-1610). Association for Computational Linguistics.

Youyou, Wu, Michal Kosinski, and David Stillwell. “Computer-based personality judgments are more accurate than those made by humans.” Proceedings of the National Academy of Sciences (2015): 201418680.

Gill, Alastair J., Scott Nowson, and Jon Oberlander. “What Are They Blogging About? Personality, Topic and Motivation in Blogs.” ICWSM. 2009.

Schwartz, H. Andrew, et al. “Personality, gender, and age in the language of social media: The open-vocabulary approach.” PloS one 8.9 (2013): e73791.

Bogdanova, Dasha, Paolo Rosso, and Thamar Solorio. “Modelling fixated discourse in chats with cyberpedophiles.” Proceedings of the Workshop on Computational Approaches to Deception Detection. Association for Computational Linguistics, 2012.

Lá “Fail”-a Pádraig* – Paddy’s Day Postscript

On the challenges of spoken language identification in a live setting…..

* Lá Fheile Pádraig – St Patricks Day

The current President of Ireland (Úachtarán Na hEireann) is a poet, university lecturer and former senator from Galway named Michael D. Higgins.

On the Irish national holiday on March 17th, he delivered his annual address in Irish, the national and first official language of the Irish State.

This address was uploaded to YouTube, for which the automatic captions were provided using automatic speech recognition (ASR). YouTube has been providing this service for a number of years now, and although the results are not always stellar, they provide useful input for content-based video indexing, topic modelling and other natural language processing techniques where a non-word-for-word transcription is sufficient.

Of course, the YouTube subtitling ASR system used was an English language one, which resulted in subtitles such as:

s a Lolla pottery longer talk to me to Kayla

which should have been:

“Is é Lá ‘le Pádraig an lá go dtagaimid le chéile…”
St Patricks Day is the day that we come together.

And the rather more embarrassing:

cock merely she can’t play in his scaly

which actually reads:

“…imirceach míle sé chéad bliain ó shin…”
Immigrants one thousand six hundred years ago

More fun mis-transcriptions here, eerily reminiscent of a recent controversy in the Northern Irish Assembly.

Spoken language recognition is a trickier beast than the written-language counterpart, although the technology is in existence and no doubt used for more nefarious purposes than this one, but whether having this technology in the pipeline for YouTube uploads may not be a question for today’s post.

As far as I am aware, Google does not currently deploy an Irish-language ASR system at all, although their Google Translate offering for Irish is not bad (At least in the Irish-English direction), if maybe not fit for official purposes. Official documentation of the Irish state, and also a subset of EU documentation must by law be translated into Irish.

Furthermore with regard to Irish-language speech recognition in general, there does not appear to be any great interest in academic circles either, alluded to by Judge et. al. (2012) who note:

“In the area of automated speech recognition there has been no development to date for Irish, however many of the resources which have been developed for synthesis are also crucial for speech recognition and to this extent the foundations for this aspect of technological development are being laid”

As the language is currently only spoken on a daily basis by an ever-dwindling number of people, the demand for such a system could be rather low. Although there are plenty of Irish language recordings in the wild which could be used in the process of creation, the difference in dialects and pronunciation from the different regions where Irish is still spoken could make training an ASR system difficult, however preserving the language through digitisation and creation of these system could be a very valid step in the preservation of the language. The Phonetics and Speech Laboratory at TCD also has extensive experience in Irish language technology systems although synthesis is the prevailing modality here also.

So cad a cheapann tú (what do you reckon) Google? Maybe by next Paddy’s Day, you could train up a small Irish-language speech recognition system and give Michael D. the subtitling he deserves!

Either that or upload a manual translation in advance, but sure where’s the fun in that.

Is maith an scéalaí an aimsir.
(lit) Time is a good storyteller.
Time will tell.


John Judge, Ailbhe Ní Chasaide, Rose Ní Dhubhda, Kevin P. Scannell, and Elaine Uí Dhonnchadha. An Ghaeilge sa Ré Dhigiteach – The Irish Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Springer, Heidelberg, New York, Dordrecht, London, September 2012. Georg Rehm and Hans Uszkoreit (series editors)

Sorry for your Troubles : Hiberno-English and a history of euphemism.

In which the author gives out* about the Hiberno-English tendency towards euphemism

*tabhairt amach
(lit) to give out
to complain about something

Níl aon tinteán mar do thinteán féin.
(lit) There is no fireside like your own fireside
There’s no place like home
Irish proverb

A topical post for the day that’s in it (An lá atá inniu ann – given the special occasion of this day (being St Patrick’s Day)),

I was reminded recently about the Irish skill at avoiding the more delicate subjects of conversation, despite being the supposed keepers of the gift of the gab.

Take two of the more turbulent periods in 20th century Irish (and world) history, The Second World War and the Conflict in Northern Ireland.

In Hiberno-English and historically in Ireland, these are referred to as The Emergency and The Troubles respectively. The latter in particular sounds to me like a very understated description of a 30-year-long armed conflict claiming over three thousand lives, with the former evoking a much less serious event than the deadliest conflict in the history of humanity.

Of course, Ireland’s neutrality and the ever-present loss in translation between the Irish language and English may have influenced the nomenclature in the case of The Emergency at least, although other neutral countries did not skirt the issue in their native language. Neutral Switzerland referred to the period as Grenzbesetzung (Border Occupation) 1939–45,  although their closer proximity to matters at hand may have played a role in this description.

Jennifer O’Connell writes in the Irish Times (paywall) about the supposed Irish humility and love of the word sorry, remarking that:

That’s because when an American – or an English person or an Australian – says “sorry”, they usually mean they are sorry. Out of the mouth of an Irish person, ‘sorry’ can mean anything from “get out of my way” to “I didn’t hear you” to “I’m sad for you”. It is rarely used to denote an actual apology, probably because we are seldom wrong.

At the same time, the long-windedness of Irish English can appear deferent, along with leading in the negative, as in:

You wouldn’t fancy a pint now, would ya?

You wouldn’t be wanting to head out out, would ya?

The Irish language’s lack of words for yes and no gave Hiberno English its delightful ’tis as an affirmation, and a conspicuous absence of the word yes, allowing Irish politicians to skirt delicate issues with ease.

Another unfriendly Hiberno-English euphemism that has risen in recently years has been the teeth-gritting epithet non-national, a softened foreigner who has probably been resident in the Auld Sod for more years than he or she cares to remember and could even hold citizenship of the Emerald Isle but happens to have a place of birth which is not the Land of Saints and Scholars.

Indeed, the term has been ridiculed in both traditional and online media, with a possible interpretation of denoting a form of statelessness akin to Tom Hanks’ character in The Terminal, which of course was not the intended meaning, being an abbreviation of the Irish Immigration Bureau’s officialese term non-EEA-national.

At risk of sounding like a typical Irish begrudger, I’d like to say my piece by reminding that despite its fondness of euphemism and other foibles, Hiberno-English has given the world the judged-Nobel-worthy poetry and prose of Yeats, Heaney, Shaw and Beckett and the equally-Nobel-worthy-IMHO writings of Joyce, Synge and Swift, the words smithereens, phoney and through the Anglicisation of my own Gaelic family name O’Loinsigh, a handy term for mob murder found in many languages worldwide!

Happy Paddy‘s Day everyone!