Research Retrospective #2: The Constant(ine) Stylometer

Where the author collaborates on some classical computational sleuthing and thereafter earns a jaunt to the Granite City.

During my MSc research, I was fortunate enough to be able to collaborate on a range of interesting side-projects, thanks to the generosity and broad-mindedness of my supervisor.

One such project was born out of a chance research meeting in the Italian university city of Pavia (alma mater of the great Alessandro Volta and a handful of Nobel Laureates) between my then supervisor Prof. Carl Vogel of TCD and a then-Phd student in Italian corpus linguistics at the University of Pavia, Dr. Francesca Frontini.

So as my own slightly hazy memory of the story went, said supervisor and student were strolling the grounds of the university of Pavia (as with any good classical European university established in the 14th century, these were interspersed around the city itself), when they happened upon a statue of Renaissance humanist Lorenzo Valla.

According to legend, said student then explained the noteworthiness of this prestigious alumnus of the university and his contribution to scholarship, which consisted of his analysis of the Donation of Constantine in 1440 and his pronouncement of the text as a forgery, among the first to do so and the since-accepted truth of the matter.

Donation of Constantine: A Primer

The Donation of Constantine was a historical Latin text which was purported to have been written by the Roman Emperor Constantine himself, transferring the authority of Rome and the Roman Empire to the Pope.

Namechecked by Dante Aligheri himself in his Comedia Divina, the authenticity of the document was doubted by many since it was unearthed but it was not until Valla’s definitive analysis in 1440 that its status as a forgery was accepted.

Valla’s analysis did consist of a stylistic analysis of the language used in the document, claiming that it was more similar to texts from the 8th century rather than it’s purported 4th century origins but also comprised of a content-based approach, decrying logical fallacies in the interpretation of the text itself.

Being an eager stylistics scholar always on the lookout for a new project, my supervisor’s interest was piqued by this tale of forgery and fallacy and resolved to bring the technology of the 21st century to bear on this ancient text.

Revisiting the Donation

With the help of some scripts written by myself during my ongoing analysis of character idiolects in drama for my MSc and the aforementioned Dr Frontini who contributed the lions share of the analysis and expertise in Latin scholarship, he set about creating an experiment to compare the Donatio across temporal corpora of Latin text in order to investigate whether the stylistic idiosyncracies pointed out by Valla could be uncovered using computational means.

The study consisted of two parts, an authorship component which sought to compare the Donation with almost 300 Latin texts from different authors, some of which were anonymous, and a second diachronic step which sought to place the text in a fitting historical period. Here I have focused on the authorship step for the sake of brevity but the curious should consult the paper for a more thorough insight into both parts.

One possibly interesting aspect of the study is the features used for stylistic matching, in this case letter bigrams were used as a means of capturing stylistic variation in Latin. Character unigrams have been used in a number of stylometric projects of mine and can often result in interesting albeit difficult to interpret results, (a more recent study on diachronic text dating confirmed this, to be discussed in a later post)

The temporal periods used were as follows:

archaic age: early Latin text until 100 B.C.

classical age: 100 A.D. – 250/300 A.D

late imperial Latin: 300 – 600

early middle ages: 600 – 1000

high middle ages: 1000 – 1400

humanists: 1400-1650

modern and contemporary Latin: 1700 – today

and more details on the corpus are elegantly described in the paper, (Frontini, 2008) which was published at the AISB’08 Symposium on Style in Text at the AISB 2008 conference in Aberdeen, Scotland.

Based on the first experiment, the Donation was found to match closely to the writings of Ammianus, a 4th century Roman historian writing around the time when the Donation was purported to be written. As we were careful to claim in the paper, this does not degrade Valla’s claim one bit but poses an interesting question as to a possible source for the document itself, given that the temporal period is the same as that which is claimed.

However, there is a twist in the tale:

From (Frontini, 2008)

It is worth pursuing a possibility that Ammianus provided the source language input that shaped the forger’s concept of fourth-century Latin. There is however, substantial reason to doubt such direct influence. Although Ammianus was Greek and his native language was Greek, he composed History in Latin, as the work was intended for Roman readers. The work consisted of 31 books and earned the author a considerable reputation in his day. It maintained at least some of its popularity until the 6th century, but then fell into neglect and is not mentioned during the Middle Ages. His work, given scholarship methods of the time, would not have been natural for an 8th century forger to stumble upon……

 This analysis highlights the importance of collaborative scholarship between humanists and those of a more computational leaning, as without this domain knowledge, the study could make a number of claims which ultimately would ring false.

TLDR; Humanities scholars are very nice folks and full of knowledge and interesting ideas, go talk to them!

Lessons learned

Taken as a whole, the study is an interesting application of stylometric research but also a cautionary tale against the reliance on technology alone in authorship and stylometric studies. As common sense should prevail in the end, we must accept the shortcomings of our tools and use our own judgement in tandem.

The future work makes claim to investigate the corpus further using an expanded feature set of letter trigrams, word unigrams and bigrams. However, as with many best laid plans of an academic nature, to the best of my memory, this analysis did not come to fruition.

In the interim, it is heartening to see that research marrying quantitative textual analysis and classical language has rumbled along, as an example see (Remissong, 2011)

Recently I was pleasantly surprised on Twitter by the release of a massive (9GB) Latin corpus based on work by (Bamman et al , 2012, Bamman 2006), as part of their “Google Ngrams for Latin” Perseus project based on Latin text from the Internet Archive.

As for my own work, I have recently applied the character n-gram feature set (N = 1,2,3) as part of an ensemble classifier for a diachronic text classification question at SemEval 2015 which ended up performing quite well for the task at hand (detecting the era of composition of a written news text), although still perplexing in terms of the why of the matter.

And finally, what of the aforementioned Granite City? Well first impressions were cold and dark and not particularly friendly, but the Indian food was good, the university was venerable and plenty of new academic friendships were born over an ale or three!


Frontini, F., Lynch, G., & Vogel, C. (2008). Revisiting the donation of constantine. In AISB 2008 Convention Communication, Interaction and Social Intelligence (Vol. 1, p. 1).

Remissong, A. J. (2011). Dulce et Utile: On the Use of Quantitative Textual Analysis in Latin Literary Analysis (Doctoral dissertation, Emory University).

Bamman, D., & Smith, D. (2012). Extracting two thousand years of Latin from a million book library. Journal on Computing and Cultural Heritage (JOCCH), 5(1), 2.

Interesting resources

Looking at Latin with traditional corpus analysis tools

Latin corpus from Bamman with details:


One thought on “Research Retrospective #2: The Constant(ine) Stylometer

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s