After a short interlude, a tale of how the author got his diachronic text classification groove back, following from trying to date the Donation of Constantine a few years previously.
Time isn’t holding us
Time isn’t there for us
Once in a Lifetime
Language is in constant flux. New words are coined at a speed which the official lexicographers can struggle to keep up with, and older ones fall out of favour gradually. Culturnomics is a loosely defined mashup of data and social science which uses large databases of scanned and OCR’ed book text to investigate cultural phenomena through language usage. These large textual databases allow for hitherto impossible queries over our mass cultural heritage.
Late last year I attended a presentation by Dr. Terrence Szymanski of the Insight Centre for Data Analytics in UCD. He presented preliminary work on a very interesting problem in text analytics, namely the chronological dating of a text snippet.
This work was related to a shared task at the SEMEVAL 2015 workshop, for the non-initiated, a shared task is a form of academic Hunger-Games where different teams battle it out to obtain the best performance on a set academic challenge normally using a shared dataset.
Philosophical questions about such methods of engaging in research aside, a shared task does have advantages, namely a constrained set of a parameters and data and to the victor, the spoils (usually only honour and glory, but sometimes prizes). This methodology was adopted by the data science challenge site Kaggle, where companies can offer challenges and datasets for real monetary reward.
My interest piqued by the description of the challenge, having encountered the same subject in the doctoral work of a former colleague at TCD, we set to work on a text analytics system which could tackle this particular problem.
Delving into the literature, it seemed there were three main schools of thought on the subject of diachronic text classification:
The first was likely born out of research by corpus linguists and dealt with the self-same task of dating a text chronologically, however focused on what I like to refer to as document-level metrics such as sentence length, word length, readability measures and lexical richness measures such as type-token ratio etc. This group looked at large traditional language corpora such as the British National Corpus or the Brown Corpus. (Stajner 2011)
The second faction concerned itself with smaller, more quotidian matters, such as the temporal ordering of works e.g. Early< Middle < Late Period by a particular author or group of authors, rather than labelling a work with a specific year of composition.
Work by (Whissell, 1996) looked at the lyrical style of Beatles compositions and found them to be
“ less pleasant, less active, and less cheerful over time “
Other works focused on ranking Shakespearean drama and Yeats plays by composition order, however common traits were the small-scale nature of the corpora and the focus on creative works in the humanities, (Stamou, 2008)
The final third group of studies had a different focus, namely the quantification of word meaning change over time. This could then be used to infer temporal period, (Mihalcea 2012)
Take for example the Google Ngrams plot for the word hipster.
Three extracts from the corpus below show the shift in meaning for this term over time. The word originated during the Beat Generation as a slightly derogatory term for a (generally Caucasian) scenester who made his business to hang out with jazz musicians and imitate their sartorial style.
As the Fifties drew to a close this bohemian counterculture archetype was replaced by hippies, draft dodgers and free-love enthusiasts but the term enjoyed a surprising renaissance since the 2000s as evidenced in the third extract.
Examining the word windows in each era, we might expect our 50’s hipster to collocate with jazz, music, instruments, groove etc, while the hipster of today has different bedfellows, irony, Pabst Blue Ribbon, thrift stores, fixies, etc.
- (1956) The Story of Jazz, p 223: The hipster, who played no instrument, fastened onto this code, enlarged it, and became more knowing than his model.
- (1988) Interview with Norman Mailer: Well, you would say that hipsters do this in a vacuum. I don’t. It’s just that a hipster’s notion of morality is so complex. A great many people hate hip because it poses a threat to them. (Kerouac, Beat Generation, Jazz)
- (2009) Time Magazine: Hipsters are the friends who sneer when you cop to liking Coldplay. They’re the people who wear t-shirts silk-screened with quotes from movies you’ve never heard of and the only ones in America who still think Pabst Blue Ribbon is a good beer.”
- Assign a date range to short news text
- Dates range from 1700 to 2014
- Training set contains 3100 texts, 225k words (71 words per text)
Our approach to the problem fit squarely in the first camp, given a snippet of text, can we estimate a date of production. In reality, an exact date match was not expected (although the system often came close!), and the results were evaluated with respect to temporal spans of 6 years, 12 years, 20 years and 50 years.
The text was represented using a number of main feature types
- Character n-grams (d_w_i),
- POS tag n-grams (DT_NNP)
- Word n-grams (the_man_who).
- Syntactic and phrase structure ngrams: slices of a sentence, (S -> NP VP) and also terminal nodes (N -> cat).
The syntactic n-grams contained information about semantic role (subject/object of a sentence) in addition to part-of-speech and terminal information.
Two main approaches were taken to generate features. One set of features were generated from the shared task textual corpus itself, and the other set were taken from the Google Syntactic N-Grams corpus which is a date-tagged corpus of n-grams extracted from millions of books.
External corpus features
The first step required calculating a number of probabilities for words given a year, and then multiplying these together and obtaining the max probability, making the naïve assumption that years are uniformly distributed in the corpus.
p(“hipster”|1956) = 0.2
p(“hipster”|2003) = 0.3
p(“Pabst Blue Ribbon”|1956) = 0.0
p(“Pabst Blue Ribbon”|2009) = 0.2
The example (2-word) document contains both hipster and Pabst Blue Ribbon.
p(“hipster Pabst Blue Ribbon”|1950) ~= 0.5 * 0.0 = 0.0
p(“hipster Pabst Blue Ribbon”|2009) ~= 0.3 * 0.2 = 0.06
In reality, we used the log probability and normalized it in the range (0,1)
This feature set worked fairly well for classification, with the following cross-validation values on the training set using a Naïve Bayes classifier, and the probabilities of the words in the document.
An improvement was obtained when these probabilities were used as a feature for a Support Vector Machine Classifier. 309 features were generated for each text, these were the normalized log probabilities for each text being written in each of the years for which documents were present in the training set.
Internal corpus features
The other feature set used was generated from the corpus itself. The entire training corpus was tagged with the Stanford CoreNLP tools and a number of feature types were employed. Ngrams of length <= 3 were used for words, characters and POS tags.
The feature set generated was large, with 11,109 features in total. Reducing the feature set size using feature selection improved the classification results.
The most highly ranked features in the full feature set using the Information Gain metric were POS tags and character unigrams. These included the NN tag for common nouns, the fullstop (.) and other metrics (ROOT->S) from the grammatical parses which are likely a proxy feature for sentence length.
A number of letter unigrams (i,a,e,n,s,t,l,o) were in the top 20 most discriminating features.
Reasons for the prevalence of these characters are currently difficult to articulate. The English language contains an uneasy marriage of Latinate and Germanic vocabulary and a shift in usage could manifest itself in the frequencies of certain letters changing. A change in verbal form or orthography (will -> going to, ‘d to -ed) for example, could change the ratio of character ngrams in documents.
Below is a list of the fifty most distinctive word n-gram features from the feature set:
the, a, . the, in, is, on, said, it, its, and, of, of the u.s., president, government, has, american, today king, majesty, united, international, the united, the said, would, to, it is, messrs, minister, national, as letters, special, china, official, on the, public, central economic, not, more, mr, in the, million, can, was recent, prince, chinese, dollars, talks, Russian,
she, group, south, that, the king
The system identifies a number of adjectives (national, international, economic, public, central, official) which rise during the 20th century, also words related to world leaders (president, government, king, majesty, minister) and dominant countries (China, Russia, United States, American).
Various arcana such as “the said gentleman” and Messrs were on the wane as the nineteenth century drew to a close.
And to the winner, the spoils…..
Our system trained on the full set of 11,000+ features was entered into one subtask of the shared task and it obtained the best overall result of the three competing systems. Full details on accuracy results and results in the task can be found in the papers.
Special mention must be given to the USAAR-CHRONOS team who expertly hacked the diachronic text classification task, using Google queries for the texts to assign date information based on metadata extracted from the source. Well played sirs!
Although the character ngram features performed excellently in the task, it can be difficult to interpret these results as a linguistic evolution of style. The prevalence of the period character and other sentence-specific tagging may be due to the fact that the system has identified sentence length as a discriminatory feature by proxy, although we did not measure it explicitly.
A number of competing systems took metrics such as sentence length and lexical richness measures into account and these may indeed be useful for future experimentation, in concert with existing features.
Another approach taken by a participating team was to extract epoch-specific named entities from the text and using date occurrence information in articles from an external database such as Wikipedia or DBpedia to assign date information to these texts. A processing framework to handle this could be an excellent addition to the classifier approach undertaken here and future work evaluating both approaches back-to-back would be useful.
Time after time
When the dust settles after the battle, it is important to take stock of what has been learned.
- A task like this is a great introduction to a research question and often necessitates getting up to speed on a particular topic in a very short space of time.
- My Two Cents: If you have the technical know-how, it can often save time in the long run implementing your own text concordancing tools rather than relying on a mix of off-the-shelf packages roughly cobbled together.
- Going to departmental seminars increases the chance of serendipitously sparking a fruitful collaboration.
- Machine-optimised textual features, although useful for classification tasks are not necessarily the most intuitive for human interpretation.
- Character n-grams do have a certain degree of “black magic” and are not all equally useful (Sapkota 2015), although their flexibility captures syntactic (gaps between words, period frequency as proxy for sentence length), morphological and orthographical shifts (‘d -> ed, -ing) and semantics (short words)
- More focus in future studies should be given to variables such as sentence length, type token ratio and other statistics computed on an entire text.
References and Further Reading:
Frontini, G. Lynch, and C. Vogel. Revisiting the Donation of Constantine’. In Proceedings of AISB 2008, pages 1–9, 2008. (Earlier blog post on this work here)
Stamou, C Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating. In Literary and Linguistic Computing, pages 181-199, 2008.
Forsyth, R, Stylochronometry with substrings, or: A poet young and old. In Literary and Linguistic Computing, 14(4), 467-478, 1999.
S Stajner, and R Mitkov. Diachronic stylistic changes in British and American varieties of 20th century written English language. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, RANLP,2011
Goldberg and J. Orwant. A dataset of syntactic-ngrams over time from a very large corpus of English books. In Proceedings of *SEM 2013, pages 241–247, 2013.
Mihalcea and V. Nastase. Word epoch disambiguation: Finding how words change over time. In Proceedings of ACL 2012, 2012.
Popescu and C. Strapparava. Semeval-2015 task 7: Diachronic text evaluation. In Proceedings of SemEval 2015, 2015.
Szymanski, Terrence, and Gerard Lynch. “UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams.” In Proceedings of SemEval 2015, 2015.
Upendra Sapkota, Steven Bethard, Manuel Montes, Thamar Solorio. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution In Proceedings of NAACL HLT, 2015
Whissell, Cynthia. “Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon.” Computers and the Humanities 30.3, 1996: 257-265.