Research Retrospective #3 : The (not so) Constant Garnett

In which the author wades headlong into a centuries-old battle for the soul of Russian literature and realises that puns based on a certain John Le Carré novel are starting to wear thin and probably not helping with SEO….

Traduttore, traditore
Translator, traitor
Italian proverb

La original no es fiel do traduccíon
The original is unfaithful to the translation
José Luis Borges (1899-1986)

During the summer of 2014 when packing to travel to the ACL conference in Baltimore, MD , I happened upon a 2005 article in the New Yorker by David Remnick entitled
The Translation Wars, which I promptly downloaded to my Kindle to read on the trans-Atlantic flight.

In this long and fascinating treatise on translation, obsession and literature, the author investigates the work and lives of several English-language translators of Russian literature by greats such as Fyodor Dosteyevsky, Leo Tolstoy and Anton Chekhov during the nineteenth and twentieth century.

The aspect of the article which piqued my curiosity was the rather harsh treatment of one of the first English-language translators of Russian literature, Mrs. Constance Garnett.

The unfortunate Mrs. Garnett was a Victorian-era librarian by training, whose love affair with the Russian masters was likely sparked by a flesh-and-blood love affair with a dashing Russian revolutionary exile who she met in London at the turn of the 20th century. She translated over 70 works of Russian literature into English during her lifetime, working mostly alone in a stone cottage in Kent, translating the giants of Russian literature by candlelight, her solitary work punctuated by the occasional trip to Russia for inspiration.

I had encountered Garnett’s work in a previous study on source-language influence in translation, but had not investigated her canon in any in-depth way prior to then.
In The Translation Wars, Remnick describes the reaction of prominent Russian authorities of the day to the translations of Garnett. Vladimir Nabokov of Lolita fame quarrelled with Joseph Conrad over the quality of her work, as described by Remnick (2005)

On the blank left-hand page, Nabokov has written a quotation from Conrad, who told Garnett’s husband, Edward, “Remember me affectionately to your wife, whose translation of Karenina is splendid. Of the thing itself I think but little, so that her merit shines with greater lustre.” Angrily, Nabokov scrawls, “I shall never forgive Conrad this crack”—he ranks Tolstoy at the top of all Russian prose writers and “Anna” as his masterpiece—and pronounces Garnett’s translation “a complete disaster.”

Fellow Russian emigré academic, author and Nobel Laureate Joseph Brodsky was perhaps less cruel, but was heard to remark:

“The reason English-speaking readers can barely tell the difference between Tolstoy and Dostoevsky is that they aren’t reading the prose of either one. They’re reading Constance Garnett.”

In her defence, Remnick mentions that Garnett worked almost at breakneck speed, with little time for fine-tuning and revisions, simply skipping sections which she didn’t not understand, almost going blind in the long process of translating War and Peace.

Despite the shortcomings raised by Nabokov and others, it remained clear that Garnett’s contribution to the awareness of Russian literature in the Anglosphere was a tremendous one, and with this I decided to attempt to venerate the lady who had given body and soul to this cause.

Research question

Focusing in particular on the quotation from Brodsky, I imagined a means in which computational stylometry could be leveraged to prove that interpretations aside, Garnett rendered the text of each author she translated in a unique manner, preserving a distinctive style for each, within her own parameters.

As Garnett’s translations were indeed the first and some of the oldest translations of certain Russian authors, it was no surprise that many could be found on Project Gutenberg and other online sources. I managed to assemble a list of works from three authors, Chekhov,Turgenev and Dosteyevsky.

Stories by Anton Chekhov
The Bishop & O. Stories
The Cook’s Wedding
The Chorus Girl
The Darling
The Duel
The Horse-Stealers
The School Master
The Party
The Wife
The Witch
Love & O. Stories

Stories by Fyodor Dosteyevsky
A Raw Youth
Brothers Karamasov
Crime & Punishment
The Insulted and The Injured
The Possessed
White Nights
Five Stories

Stories by Ivan Turgenev
A House of Gentlefolk
Fathers & Children
On The Eve
Rudin Turgenev Smoke
The Torrents of Spring
The Jew

Once these texts had been assembled, they were split in equally-sized chunks of 10 kilobytes each, resulting in 942 for the whole corpus. The features calculated here were word and POS ngrams and eighteen document-level features, which were descriptive statistics for a passage of text, including average sentence length, type-token ratio and other such measures.


Taking each author as a category, the question was:

Do the textual segments pertaining to each author cluster together?

Using ten-fold cross validation and a SVM classifier with only the document-level features, there was already a strong clustering of authors, with 87% accuracy reported for the cross-validation results. Adding word features and POS features into the mix, the final classifier reported 94% accuracy for authorial category.

To prevent proper nouns facilitating classification, such as character names and locations from each novel or author which could feasibly cluster together, all noun features were removed from the words feature set.

Examining the feature sets ranked by Information Gain, the most distinctive features proved to be common verbs, adverbs and adjectives, ratios of certain POS types (pronouns, nouns) and average word and sentence length ratios.

Common verbs had been examined in several previous studies in translation stylometry and found to be distinctive among parallel translations of the same text, but in this study, they were found to discriminate between translations of different authors by the same translator.

Reaction and analysis

On presenting this work at the 25th International Conference in Computational Linguistics (Coling) in Dublin later on the same year, the first question was generally the following:

Interesting work, but have you considered the source text?

Indeed, a parallel clustering of the original Russian sources of the translations would no doubt throw up some interesting questions as to whether the translator preserved source-text trends in translation or created a parallel stylistic projection of their own making. Unfortunately, due to my own lack of linguistic proficiency in Russian and the vagaries of the World Wide Web, I found it rather difficult to locate digital copies of the source texts of the works examined in my study, however would be very happy to collaborate on a further study with a Russian-speaker who wished to assist in this manner.

Another very valid point was that this work was only carried out on one translator and a handful of authors, and this is indeed crucial. Unfortunately, it is also rather difficult to obtain parallel translations of the same source online, let alone find translators who have translated multiple authors and also from multiple languages, but future studies will seek to investigate such norms across authors, source languages and perhaps even target languages.

Further work would also seek to leverage more interesting features such as syntactic parses, used by Lucic and Blake (2011) in their work on translations of Rainer Maria Rilke. Other interesting studies on the topic of translatorial vs. authorial style include work by Forsyth et. al. (2013) who investigate parallel translations of the Van Gogh brothers’ correspondence for authorial and translatorial style (evidence of both are found, authorial being more prominent) and Rybicki (2012) who fails to find consistent stylometric markers of translators across translations of different authors in various languages, however finding elements of stylistic variation when a translator changes mid-translation in Rybicki (2013).


Lynch, G. (2014). A supervised learning approach towards profiling the preservation of authorial style in literary translations. Proceedings 25th COLING, 376-386.

Remnick, D. (2005). The translation wars. The New Yorker, 7, 98-109.

Jan Rybicki and Magda Heydel. (2013). The stylistics and stylometry of collaborative translation: Woolf’ s Night and Day in Polish. Literary and Linguistic Computing, 28(4):708–717.

J. Rybicki. (2012). The great mystery of the (almost) invisible translator. Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research, page 231.

Ana Lucic and Catherine Blake. (2011). Comparing the similarities and differences between two translations. In
Digital Humanities 2011, page 174. ALLC.

Richard S. Forsyth and Phoenix W. Y. Lam. (2013). Found in translation: To what extent is authorial discriminability preserved by translators? Literary and Linguistic Computing.

Article about Constance Garnett’s life and work:

Charles A Moser. (1988). Translation: The Achievement of Constance Garnett. The American Scholar, pages 431–438.

Research Retrospective #2: The Constant(ine) Stylometer

Where the author collaborates on some classical computational sleuthing and thereafter earns a jaunt to the Granite City.

During my MSc research, I was fortunate enough to be able to collaborate on a range of interesting side-projects, thanks to the generosity and broad-mindedness of my supervisor.

One such project was born out of a chance research meeting in the Italian university city of Pavia (alma mater of the great Alessandro Volta and a handful of Nobel Laureates) between my then supervisor Prof. Carl Vogel of TCD and a then-Phd student in Italian corpus linguistics at the University of Pavia, Dr. Francesca Frontini.

So as my own slightly hazy memory of the story went, said supervisor and student were strolling the grounds of the university of Pavia (as with any good classical European university established in the 14th century, these were interspersed around the city itself), when they happened upon a statue of Renaissance humanist Lorenzo Valla.

According to legend, said student then explained the noteworthiness of this prestigious alumnus of the university and his contribution to scholarship, which consisted of his analysis of the Donation of Constantine in 1440 and his pronouncement of the text as a forgery, among the first to do so and the since-accepted truth of the matter.

Donation of Constantine: A Primer

The Donation of Constantine was a historical Latin text which was purported to have been written by the Roman Emperor Constantine himself, transferring the authority of Rome and the Roman Empire to the Pope.

Namechecked by Dante Aligheri himself in his Comedia Divina, the authenticity of the document was doubted by many since it was unearthed but it was not until Valla’s definitive analysis in 1440 that its status as a forgery was accepted.

Valla’s analysis did consist of a stylistic analysis of the language used in the document, claiming that it was more similar to texts from the 8th century rather than it’s purported 4th century origins but also comprised of a content-based approach, decrying logical fallacies in the interpretation of the text itself.

Being an eager stylistics scholar always on the lookout for a new project, my supervisor’s interest was piqued by this tale of forgery and fallacy and resolved to bring the technology of the 21st century to bear on this ancient text.

Revisiting the Donation

With the help of some scripts written by myself during my ongoing analysis of character idiolects in drama for my MSc and the aforementioned Dr Frontini who contributed the lions share of the analysis and expertise in Latin scholarship, he set about creating an experiment to compare the Donatio across temporal corpora of Latin text in order to investigate whether the stylistic idiosyncracies pointed out by Valla could be uncovered using computational means.

The study consisted of two parts, an authorship component which sought to compare the Donation with almost 300 Latin texts from different authors, some of which were anonymous, and a second diachronic step which sought to place the text in a fitting historical period. Here I have focused on the authorship step for the sake of brevity but the curious should consult the paper for a more thorough insight into both parts.

One possibly interesting aspect of the study is the features used for stylistic matching, in this case letter bigrams were used as a means of capturing stylistic variation in Latin. Character unigrams have been used in a number of stylometric projects of mine and can often result in interesting albeit difficult to interpret results, (a more recent study on diachronic text dating confirmed this, to be discussed in a later post)

The temporal periods used were as follows:

archaic age: early Latin text until 100 B.C.

classical age: 100 A.D. – 250/300 A.D

late imperial Latin: 300 – 600

early middle ages: 600 – 1000

high middle ages: 1000 – 1400

humanists: 1400-1650

modern and contemporary Latin: 1700 – today

and more details on the corpus are elegantly described in the paper, (Frontini, 2008) which was published at the AISB’08 Symposium on Style in Text at the AISB 2008 conference in Aberdeen, Scotland.

Based on the first experiment, the Donation was found to match closely to the writings of Ammianus, a 4th century Roman historian writing around the time when the Donation was purported to be written. As we were careful to claim in the paper, this does not degrade Valla’s claim one bit but poses an interesting question as to a possible source for the document itself, given that the temporal period is the same as that which is claimed.

However, there is a twist in the tale:

From (Frontini, 2008)

It is worth pursuing a possibility that Ammianus provided the source language input that shaped the forger’s concept of fourth-century Latin. There is however, substantial reason to doubt such direct influence. Although Ammianus was Greek and his native language was Greek, he composed History in Latin, as the work was intended for Roman readers. The work consisted of 31 books and earned the author a considerable reputation in his day. It maintained at least some of its popularity until the 6th century, but then fell into neglect and is not mentioned during the Middle Ages. His work, given scholarship methods of the time, would not have been natural for an 8th century forger to stumble upon……

 This analysis highlights the importance of collaborative scholarship between humanists and those of a more computational leaning, as without this domain knowledge, the study could make a number of claims which ultimately would ring false.

TLDR; Humanities scholars are very nice folks and full of knowledge and interesting ideas, go talk to them!

Lessons learned

Taken as a whole, the study is an interesting application of stylometric research but also a cautionary tale against the reliance on technology alone in authorship and stylometric studies. As common sense should prevail in the end, we must accept the shortcomings of our tools and use our own judgement in tandem.

The future work makes claim to investigate the corpus further using an expanded feature set of letter trigrams, word unigrams and bigrams. However, as with many best laid plans of an academic nature, to the best of my memory, this analysis did not come to fruition.

In the interim, it is heartening to see that research marrying quantitative textual analysis and classical language has rumbled along, as an example see (Remissong, 2011)

Recently I was pleasantly surprised on Twitter by the release of a massive (9GB) Latin corpus based on work by (Bamman et al , 2012, Bamman 2006), as part of their “Google Ngrams for Latin” Perseus project based on Latin text from the Internet Archive.

As for my own work, I have recently applied the character n-gram feature set (N = 1,2,3) as part of an ensemble classifier for a diachronic text classification question at SemEval 2015 which ended up performing quite well for the task at hand (detecting the era of composition of a written news text), although still perplexing in terms of the why of the matter.

And finally, what of the aforementioned Granite City? Well first impressions were cold and dark and not particularly friendly, but the Indian food was good, the university was venerable and plenty of new academic friendships were born over an ale or three!


Frontini, F., Lynch, G., & Vogel, C. (2008). Revisiting the donation of constantine. In AISB 2008 Convention Communication, Interaction and Social Intelligence (Vol. 1, p. 1).

Remissong, A. J. (2011). Dulce et Utile: On the Use of Quantitative Textual Analysis in Latin Literary Analysis (Doctoral dissertation, Emory University).

Bamman, D., & Smith, D. (2012). Extracting two thousand years of Latin from a million book library. Journal on Computing and Cultural Heritage (JOCCH), 5(1), 2.

Interesting resources

Looking at Latin with traditional corpus analysis tools

Latin corpus from Bamman with details:

Research Retrospective #1 : Parsing the Play (‘s the Thing)…

In the first of a series of retrospective posts, I take a look at past publications and cast a critical eye over them in light of new developments and results in the field. 

“I wish that I knew what I know now, when I was younger”

Ooh La La , The Faces (1973)

In December 2007, I presented my first ever academic publication (Lynch, 2008) at the Special Interest Group for Artificial Intelligence Conference of the British Computer Science Society, an annual conference held in the majestic surroundings of Peterhouse College Cambridge (

As a budding academic and M.Res student, I was spellbound by the Hogwarts-esque location and the conference dinner held at candlelight in the Great Dining Hall of Peterhouse, one of the more traditional of the Cambridge Colleges, established in 1284 and a very fitting venue for studies marrying literature and technology, although I recall thinking that my work was not exactly fitting for the traditional AI crowd. ( I had yet to discover the field of digital humanities)

The after-dinner speaker for the evening was an elderly computer scientist who had worked at Bletchley Park with Alan Turing himself, and listening to this elder statesman pontificate about the travails of the discipline of artificial intelligence and the spotted history of computer science funding in the 20th century was rather inspiring indeed. I recall at the time feeling inspired that I was part of a great line of tradition stretching back to Turing and his forebears.

The paper in question was a by-product of my Master’s research which was concerned with investigating the stylistic divergence between character idiolects created by playwrights. Although the medium of drama is indeed ideal for character studies such as this, it became clear that separating a play into constituent character contributions was not necessarily as trivial as it initially seemed.

Our approach consisted of implementing a two-pass parser which operated on an ASCII text file. The first pass attempted to identify character contribution markers which marked the speech of a character, which were usually sentence initial, of indeterminate length and possibly terminated by punctuation.

From Two Gentlemen from Verona by William Shakespeare:


And why not death rather than living torment?
To die is to be banish’d from myself;
And Silvia is myself: banish’d from her
Is self from self: a deadly banishment!
What light is light, if Silvia be not seen?
What joy is joy, if Silvia be not by?
Unless it be to think that she is by
And feed upon the shadow of perfection
Except I be by Silvia in the night,
There is no music in the nightingale;
Unless I look on Silvia in the day,
There is no day for me to look upon;
She is my essence, and I leave to be,
If I be not by her fair influence
Foster’d, illumined, cherish’d, kept alive.
I fly not death, to fly his deadly doom:
Tarry I here, I but attend on death:
But, fly I hence, I fly away from life.


After the first pass which collected a possible list of characters, a form of virtual dramatis personae, the user was then engaged to filter out the erroneous entries in the list, corresponding often to mispellings from OCR or transcription, stage directions or otherwise non-character related textual functions. Later iterations of the system displayed information on the frequency of each character string beside the string, lest the user be unsure of the provenance of an entry. Once a valid list of characters had been collected, then the system performed it’s second pass through the data, segmenting the data into separate files named after the characters.

Issues with the approach included the inclusion/exclusion of stage directions and elaboration and of course the difficulty of ensuring that character introductions were actually valid characters (requiring knowledge of the play in question)

A further iteration of the system was planned and an early systems draft was submitted to the Digital Humanities Conference in 2010, this version captured data on the interactions between characters and created a rough visualisation of the contributions in the form of a Gantt chart. This work was unfortunately rejected and due to time constraints and shifting priorities the project was ultimately abandoned around this time.

The system itself was used extensively for the experiments which made their way into my M.Sc thesis and has also lain dormant since 2009 or so.

Meanwhile, related work on social networks in literature and drama has gained traction, (Gil, 2011), and a particularly fine example of this work is (Elson, 2010) that focuses on literary fiction itself, a more onerous task indeed however one which they tackle with suitable aplomb.

Very recent work presented at the 2014 LaTeCH workshop at EACL, (Agarwal, 2014) has reignited interest in the extraction of information from dramatic texts, and in this case the researchers focused on social network structure in screenplays using more complex machine learning and natural language processing techniques, remarking that:

While there is motivation in the literature to parse screenplays, none of the aforementioned work addresses the task formally.

and also:

While researchers have previously motivated the need for parsing movie screenplays, to the best of our knowledge, there is no work that has presented an evaluation for the task. Moreover, all the approaches in the literature thus far have been regular expression based.

To misquote the Bard:

All’s fair in love and academe…..


Lynch, Gerard, and Carl Vogel. “Automatic character assignation.” Research and Development in Intelligent Systems XXIV. Springer London, 2008. 335-348.

Elson, David K., Nicholas Dames, and Kathleen R. McKeown. “Extracting social networks from literary fiction.” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.

Gil, Sebastian, Laney Kuenzel, and Suen Caroline. Extraction and analysis of character interaction networks from plays and movies. Technical report, Stanford University, 2011.

Agarwal, Apoorv, et al. “Parsing screenplays for extracting social networks from movies.” EACL 2014 (2014): 50-58.

Style over substance

This blog is dedicated to computational stylistics and stylometry research, which is a subfield of computational linguistics, born out of research and curiosity manifested in authorship attribution studies spanning centuries.

Other topics may include research in digital humanities, which is often concerned with the stylistic analysis of textual data from literary works in order to uncover hitherto-unknown patterns of style.

The title is a riff on the 2006 Oscar-winning German film Das Leben der Anderen (The Lives of Others), which incidentally deals with the science of stylistics as a major plot point. Stay tuned for a post about computational stylometry in popular culture.