Research Retrospective #5: Source (Language), (POS) Tags and (Secret) Codes : Investigating the implicit stylistic patterns in translated text

In which the author investigates the uncanny style of translated text using machine learning and finds, among other stylistic features, evidence of national stereotypes in literature. 

Poetry is what get lost in translation

Robert Frost 

Reading a poem in translation…is like kissing a woman through a veil.

Anne Michaels

Introduction

When reading a translation, particularly a literary one, one often becomes aware of a certain unheimlich nature to the prose, as if an unknown force is dragging down on the sentence structure and creating an eerie hint of disfluency. The old cliché speaks of meaning and feeling being lost in translation, but we can also view a translation as having imperceptibly gained a certain something. Thus, the textual style of a translation is neither that of the source language or the target, but a language apart, often referred to as the third code. Translation studies scholars also use the term translationese to describe the subset of a language consisting solely of translations into that language, and computational linguists have become curious of the properties of this uncanny dialect.

A large proportion of translation-related research in the field of computational linguistics focuses on training machines to do translation, so once researchers (Kurokawa 2009) figured out that the direction of your parallel corpus could be useful for MT (DE-EN corpus direction for DE-EN translation, for example), attention was again given to this subject of translationese and translation direction detection which had attained something of a academic “cult following” among mostly translation studies scholars.

During my doctoral work I focused heavily on the stylistic properties of translations, and one of these properties concerned the source language traces found in texts. This is especially prevalent when one is familiar with the source language in question:

e.g. “This sentence like German reads”

The task of detecting the features which illustrate this was a interesting challenge for machine learning tools. A fine-grained problem such as this one locates itself within the realm of stylistic classification, alongside questions such as native language identification, personality detection, sentiment analysis, gender detection, temporal period classification and others.

Corpus

As the existing literature focused on commonly used corpora such as EUROPARL (van Halteren 2008) and large multi-lingual newspaper collections (Koppel and Ordan 2011), we decided to examine literary text, an oft-neglected genre in traditional text analytics. The first port of call for literary texts is usually Project Gutenberg, although this usually limits one to texts from a specific historical period. In order to keep all confounding factors constant, the texts were all drawn from 19th century prose translations. The source languages examined were Russian, German, French and original English, and five novels were obtained to represent each set.

In order to negate any effect of individual translatorial or authorial style, no translator or author was repeated in the corpus. This work represented the beginning of my analysis of the dichotomy of textual features. N-grams, or sequences of characters, words and parts-of-speech are a potent force in text analytics. On the other hand, a text is made up of more than discrete sequences of characters, and another set of features can capture larger deviations of style. This set of metrics includes readability scores, used to measure textual complexity, metrics such as type token ratio and the ratio of various parts of speech to total words.

With the corpus assembled, random chunks of text were extracted from each work. Each textual chunk consisted of two hundred kilobytes of text, and this chunk was further divided into five equal sections. Each source language was represented by five works, these can be viewed in the paper.

Results

Using the eighteen document metrics only, the system results were relatively low. Given that the classification problem was a four-way affair and all classes were balanced, an accuracy of 67% was obtained using a Support Vector Machine classifier the ten-fold cross-validation method. Comparably using word unigrams features only and doing feature selection within the cross-validation folds, the system reported accuracies of nearly 100 percent, dropping off at the top 100 word features.

Content words

 The mostly highly ranked features in the word unigram set consisted unsurprisingly of content words. Words such as francs, paris, rue and monsieur characterized translations from French, with German texts talking of berlin, and characters named von <something>. The Russian translations were similarly marked with content words such as cossack and names such as Anton, Olenin and the like.

To test the robustness of the classifier, the top 200 features in a mixed set of bigrams, POS bigrams and word unigrams were selected, and all nouns were removed. Using the remaining fifty features and a Simple Logistic Regression classification algorithm, the sparse feature set managed an accuracy result of 85.5%.

A little untoward

 The adverbs toward and towards were found to be discriminatory for texts translated from German, and this was where a confounding factor in the data came into play. The translations of the German texts had all been published in the US where the term toward was more prevalent than towards. This particular trait is perhaps not associated with source language as strongly as other frequent word tokens.

 A contraction in terms

Contractions were found to be discriminatory, Russian had higher frequencies of the contracts it’s and that’s than French, which reported higher frequencies of the non-contracted form. In a contradictory fashion, translations from Russian reported a higher frequency of both I am and I’m than the other texts, French and German reported higher frequencies of the non-contracted forms.

Exact reasons for this behaviour remain difficult to pinpoint. On the one hand both French and German display the first person be form as two words (Ich bin and je suis) which could influence the translated text more towards the expanded form. On the other hand a Russian-speaking conference attendee who saw the work at Coling didn’t seem to think the Russian examples were related to any Russian language-specific transfer and the source of this phenomenon remains to be investigated.

The adverbial conjunction

 The translations from French reported a higher frequency of the POS n-gram RB-CC which translates as an adverb and coordinating conjunction pair. The below figure shows extracted occurrences of this n-gram in the French translation corpus, based on the following simplified text search:

grep “ly and” FrenchCorpus.txt

We can even see that in a number of cases the coordinating conjunction joins pairs of adverbs (RB CC RB) together, which may represent a grammatical structure more common in the original French.

…..No head was raised more proudly and more radiantly….. …..an offer which she eagerly and gratefully accepted….. …..unceremoniously and with no notice at all……. …..But after this I mean to live simply and to spend nothing….. …..I placed myself blindly and devotedly at your service…… …..Outwardly and in the eyes of the world …..They had parted early and she was returning home…… ……as the English law protects equally and sternly the religions of the Indian people….. ……vain attempts of dress to augment it, was peculiarly and purely Grecian…….

(Lynch 2012)

Other source-language distinguishing frequent tokens included:

  • Russian: anyone, though, suddenly, drink (allowing one to indulge a national stereotype 😉 )
  • French: resumed, towards, thousand, (apparently related to the denomination of francs)
  • German: nodded
  • English: presently, sense (common or otherwise)

Regarding the document metrics, the following were discriminant:

  • Russian: Ratio of finite verbs (higher) ARI readability score (lower)
  • French: Ratio of nouns (higher) , ARI readability score (higher), ratio of conjunctions (higher)
  • German: Ratio of prepositions (lower)

Testing the waters

 The document-level trained model was tested on an unseen corpus of contemporary literary texts. The model managed only 43% accuracy, compared with the 67% cross-validation result, however this was still above the baseline of 25%. A more training set comprised of a larger and more diverse range of texts may results in a more robust classification model.

Final thoughts

 This study focused on detecting patterns in literary translated text indicative of source language. The study identified a number of such effects both in terms of ratios of different parts of speech combinations, frequencies of individual common words and also frequency of content words. A classifier was trained which performed well on the training set but exhibited a lower accuracy on an unseen test set of fresh literary translations.

Of course, this work is still at a relatively nascent stage. The coarse grained nature of the corpus (five texts per language only) meant that any features learned could be heavily biased towards those particular texts, as seen in the evaluation results on the hold-out unseen set. A larger set of source texts would have resulted in a more generalizable model.

An expansion on this theme currently under peer review focuses on an expanded feature set and brings syntactic parse features into play on a larger (8 language + ca. 400 text) set of contemporaneous translations which can hopefully capture a deeper sense of structural transfer from the source language.

Content words have been ignored in this study as it was believed that they tended to capture topical distinctions rather than stylistic idiosyncracies. This distinction may be somewhat crude, as it can be difficult to separate what are topical norms, what are cultural norms and what are trends of linguistic transfer?

Obviously, mentions of Muscovites and cossacks might lead us towards Russian as a likely source language for a translation, but these features are not robust, as plenty of Anglophone authors (for example) may also write novels set in those regions. Likewise with the higher frequency of drink and snow. These can reflect socio-cultural norms, perhaps more frequent in texts from a particular literary tradition but do they represent true source language effects?

Unlikely.

Perhaps applications of textual clustering metrics such as LDA will shed light on topical clusters related to culture and tradition in literary corpora.

Takeaways:

 This experiment sought to shed light on source-language specific traits in translated text. A number of these traits were found, although cross-corpus testing indicates that some may be more specific to the specific literary works examined than the source language.

  • Replicability: A comparable experiment was carried out by Klaussner (2014) who examined a completely distinct set of parallel literary translations. Some differences here were the use of POS trigram features, a number of which were found to be indicative of original English. Contractions were also identified as markers of source language, together with ratios of conjunctions, average word length and type-token ratio
  • Comparability: One interesting additional step taken by Klaussner (2014) which was also done by Baroni (2006) was to present humans with the task of classifying whether a text was translated or original. Baroni (2006) found that the machine was generally more consistently accurate than the average human, although one of the ten human evaluators (an expert in translation studies) outperformed the machine. Klaussner found that human evaluators performed the task of translation vs. original classification with ca. 70% accuracy on seventeen excerpts of translated and original text, with a Kappa score of 0.406, indicating moderate agreement.
  • Transferability: A hot topic in the machine translation and computational linguistics community currently is the idea of quality estimation, roughly “Automatically detecting how bad a machine translation is and how much it might cost a human to fix it”. Approaches similar to those used here could be used to determine how similar a translation is stylistically to a corpus of translations from the same source language, with one train of thought imagining that the more stylistically similar to the source a translation is, the more post-editing required?
  • Expandability: Another area of research seeking to identify textual interference features is the field of native language detection. This seeks to identify native language influences on an author’s L2, with various applications for second language learning and author profiling. An interesting experiment could compare non-native writing and translation from the same L1 to investigate similarities/differences related to transfer.

 Further reading/References:

Linguist list post about translationese

Baroni, M., & Bernardini, S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing.

Kurokawa, D., Goutte, C., & Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings. MT Summit XII, The twelfth Machine Translation Summit International Association for Machine Translation hosted by the Association for Machine Translation in the Americas.

Klaussner, C., Lynch, G., & Vogel, C. (2014). Following the trail of source languages in literary translations. In Research and Development in Intelligent Systems XXXI (pp. 69-84). Springer International Publishing.

Koppel, M., & Ordan, N. (2011, June). Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1318-1326). Association for Computational Linguistics.

Lynch, G., & Vogel, C. (2012). Towards the automatic detection of the source language of a literary translation. In 24th International Conference on Computational Linguistics (p. 775).

van Halteren, H. (2008). Source language markers in EUROPARL translations. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1 (pp. 937-944). Association for Computational Linguistics.

Advertisements

Text Time Round AKA In Search of Lost Time(stamps)

After a short interlude, a tale of how the author got his diachronic text classification groove back, following from trying to date the Donation of Constantine a few years previously.

Time isn’t holding us
Time isn’t there for us

Talking Heads
Once in a Lifetime

Introduction

Language is in constant flux. New words are coined at a speed which the official lexicographers can struggle to keep up with, and older ones fall out of favour gradually. Culturnomics is a loosely defined mashup of data and social science which uses large databases of scanned and OCR’ed book text to investigate cultural phenomena through language usage. These large textual databases allow for hitherto impossible queries over our mass cultural heritage.

Late last year I attended a presentation by Dr. Terrence Szymanski of the Insight Centre for Data Analytics in UCD. He presented preliminary work on a very interesting problem in text analytics, namely the chronological dating of a text snippet.

This work was related to a shared task at the SEMEVAL 2015 workshop, for the non-initiated, a shared task is a form of academic Hunger-Games where different teams battle it out to obtain the best performance on a set academic challenge normally using a shared dataset.

Philosophical questions about such methods of engaging in research aside, a shared task does have advantages, namely a constrained set of a parameters and data and to the victor, the spoils (usually only honour and glory, but sometimes prizes). This methodology was adopted by the data science challenge site Kaggle, where companies can offer challenges and datasets for real monetary reward.

My interest piqued by the description of the challenge, having encountered the same subject in the doctoral work of a former colleague at TCD, we set to work on a text analytics system which could tackle this particular problem.

Background

Delving into the literature, it seemed there were three main schools of thought on the subject of diachronic text classification:

The first was likely born out of research by corpus linguists and dealt with the self-same task of dating a text chronologically, however focused on what I like to refer to as document-level metrics such as sentence length, word length, readability measures and lexical richness measures such as type-token ratio etc. This group looked at large traditional language corpora such as the British National Corpus or the Brown Corpus. (Stajner 2011)

The second faction concerned itself with smaller, more quotidian matters, such as the temporal ordering of works e.g. Early< Middle < Late Period by a particular author or group of authors, rather than labelling a work with a specific year of composition.

Work by (Whissell, 1996) looked at the lyrical style of Beatles compositions and found them to be

“ less pleasant, less active, and less cheerful over time “

Other works focused on ranking Shakespearean drama and Yeats plays by composition order, however common traits were the small-scale nature of the corpora and the focus on creative works in the humanities, (Stamou, 2008)

The final third group of studies had a different focus, namely the quantification of word meaning change over time. This could then be used to infer temporal period, (Mihalcea 2012)

Take for example the Google Ngrams plot for the word hipster.

hipster

Three extracts from the corpus below show the shift in meaning for this term over time. The word originated during the Beat Generation as a slightly derogatory term for a (generally Caucasian) scenester who made his business to hang out with jazz musicians and imitate their sartorial style.

As the Fifties drew to a close this bohemian counterculture archetype was replaced by hippies, draft dodgers and free-love enthusiasts but the term enjoyed a surprising renaissance since the 2000s as evidenced in the third extract.

Examining the word windows in each era, we might expect our 50’s hipster to collocate with jazz, music, instruments, groove etc, while the hipster of today has different bedfellows, irony, Pabst Blue Ribbon, thrift stores, fixies, etc.

  1. (1956) The Story of Jazz, p 223: The hipster, who played no instrument, fastened onto this code, enlarged it, and became more knowing than his model.
  2. (1988) Interview with Norman Mailer: Well, you would say that hipsters do this in a vacuum. I don’t. It’s just that a hipster’s notion of morality is so complex. A great many people hate hip because it poses a threat to them. (Kerouac, Beat Generation, Jazz)
  3. (2009) Time Magazine: Hipsters are the friends who sneer when you cop to liking Coldplay. They’re the people who wear t-shirts silk-screened with quotes from movies you’ve never heard of and the only ones in America who still think Pabst Blue Ribbon is a good beer.”

Task:

 http://alt.qcri.org/semeval2015/task7/ 

  • Assign a date range to short news text
  • Dates range from 1700 to 2014
  • Training set contains 3100 texts, 225k words (71 words per text)

Features

 Our approach to the problem fit squarely in the first camp, given a snippet of text, can we estimate a date of production. In reality, an exact date match was not expected (although the system often came close!), and the results were evaluated with respect to temporal spans of 6 years, 12 years, 20 years and 50 years.

The text was represented using a number of main feature types

  • Character n-grams (d_w_i),
  • POS tag n-grams (DT_NNP)
  • Word n-grams (the_man_who).
  • Syntactic and phrase structure ngrams: slices of a sentence, (S -> NP VP) and also terminal nodes (N -> cat).

The syntactic n-grams contained information about semantic role (subject/object of a sentence) in addition to part-of-speech and terminal information.

Two main approaches were taken to generate features. One set of features were generated from the shared task textual corpus itself, and the other set were taken from the Google Syntactic N-Grams corpus which is a date-tagged corpus of n-grams extracted from millions of books.

External corpus features

The first step required calculating a number of probabilities for words given a year, and then multiplying these together and obtaining the max probability, making the naïve assumption that years are uniformly distributed in the corpus.

p(“hipster”|1956) = 0.2

p(“hipster”|2003) = 0.3

p(“Pabst Blue Ribbon”|1956) = 0.0

p(“Pabst Blue Ribbon”|2009) = 0.2

The example (2-word) document contains both hipster and Pabst Blue Ribbon.

p(“hipster Pabst Blue Ribbon”|1950) ~= 0.5 * 0.0 = 0.0

p(“hipster Pabst Blue Ribbon”|2009) ~= 0.3 * 0.2 = 0.06

In reality, we used the log probability and normalized it in the range (0,1)

This feature set worked fairly well for classification, with the following cross-validation values on the training set using a Naïve Bayes classifier, and the probabilities of the words in the document.

An improvement was obtained when these probabilities were used as a feature for a Support Vector Machine Classifier. 309 features were generated for each text, these were the normalized log probabilities for each text being written in each of the years for which documents were present in the training set.

Internal corpus features

The other feature set used was generated from the corpus itself. The entire training corpus was tagged with the Stanford CoreNLP tools and a number of feature types were employed. Ngrams of length <= 3 were used for words, characters and POS tags.

The feature set generated was large, with 11,109 features in total. Reducing the feature set size using feature selection improved the classification results.

Special features

The most highly ranked features in the full feature set using the Information Gain metric were POS tags and character unigrams. These included the NN tag for common nouns, the fullstop (.) and other metrics (ROOT->S) from the grammatical parses which are likely a proxy feature for sentence length.

A number of letter unigrams (i,a,e,n,s,t,l,o) were in the top 20 most discriminating features.

Reasons for the prevalence of these characters are currently difficult to articulate. The English language contains an uneasy marriage of Latinate and Germanic vocabulary and a shift in usage could manifest itself in the frequencies of certain letters changing. A change in verbal form or orthography (will -> going to, ‘d to -ed) for example, could change the ratio of character ngrams in documents.

Words

Below is a list of the fifty most distinctive word n-gram features from the feature set:

the, a, . the, in, is, on, said, it, its, and, of, of the u.s., president, government, has, american, today king, majesty, united, international, the united, the said, would, to, it is, messrs, minister, national, as letters, special, china, official, on the, public, central economic, not, more, mr, in the, million, can, was recent, prince, chinese, dollars, talks, Russian,
she, group, south, that, the king

The system identifies a number of adjectives (national, international, economic, public, central, official) which rise during the 20th century, also words related to world leaders (president, government, king, majesty, minister) and dominant countries (China, Russia, United States, American).

Various arcana such as “the said gentleman” and Messrs were on the wane as the nineteenth century drew to a close.

19th20th

 And to the winner, the spoils…..

Our system trained on the full set of 11,000+ features was entered into one subtask of the shared task and it obtained the best overall result of the three competing systems. Full details on accuracy results and results in the task can be found in the papers.

Special mention must be given to the USAAR-CHRONOS team who expertly hacked the diachronic text classification task, using Google queries for the texts to assign date information based on metadata extracted from the source. Well played sirs!

Post-game analysis

Although the character ngram features performed excellently in the task, it can be difficult to interpret these results as a linguistic evolution of style. The prevalence of the period character and other sentence-specific tagging may be due to the fact that the system has identified sentence length as a discriminatory feature by proxy, although we did not measure it explicitly.

A number of competing systems took metrics such as sentence length and lexical richness measures into account and these may indeed be useful for future experimentation, in concert with existing features.

Another approach taken by a participating team was to extract epoch-specific named entities from the text and using date occurrence information in articles from an external database such as Wikipedia or DBpedia to assign date information to these texts. A processing framework to handle this could be an excellent addition to the classifier approach undertaken here and future work evaluating both approaches back-to-back would be useful.

Time after time

 When the dust settles after the battle, it is important to take stock of what has been learned.

  • A task like this is a great introduction to a research question and often necessitates getting up to speed on a particular topic in a very short space of time.
  • My Two Cents: If you have the technical know-how, it can often save time in the long run implementing your own text concordancing tools rather than relying on a mix of off-the-shelf packages roughly cobbled together.
  • Going to departmental seminars increases the chance of serendipitously sparking a fruitful collaboration.
  • Machine-optimised textual features, although useful for classification tasks are not necessarily the most intuitive for human interpretation.
  • Character n-grams do have a certain degree of “black magic” and are not all equally useful (Sapkota 2015), although their flexibility captures syntactic (gaps between words, period frequency as proxy for sentence length), morphological and orthographical shifts (‘d -> ed, -ing) and semantics (short words)
  • More focus in future studies should be given to variables such as sentence length, type token ratio and other statistics computed on an entire text.

References and Further Reading:

 Frontini, G. Lynch, and C. Vogel. Revisiting the Donation of Constantine’. In Proceedings of AISB 2008, pages 1–9, 2008. (Earlier blog post on this work here)

Stamou, C Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating. In Literary and Linguistic Computing, pages 181-199, 2008.

Forsyth, R, Stylochronometry with substrings, or: A poet young and old. In Literary and Linguistic Computing, 14(4), 467-478, 1999.

S Stajner, and R Mitkov. Diachronic stylistic changes in British and American varieties of 20th century written English language. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, RANLP,2011

Goldberg and J. Orwant. A dataset of syntactic-ngrams over time from a very large corpus of English books. In Proceedings of *SEM 2013, pages 241–247, 2013.

Mihalcea and V. Nastase. Word epoch disambiguation: Finding how words change over time. In Proceedings of ACL 2012, 2012.

Popescu and C. Strapparava. Semeval-2015 task 7: Diachronic text evaluation. In Proceedings of SemEval 2015, 2015.

Szymanski, Terrence, and Gerard Lynch. “UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams.” In Proceedings of SemEval 2015, 2015.

Upendra Sapkota, Steven Bethard, Manuel Montes, Thamar Solorio. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution In Proceedings of NAACL HLT, 2015

Whissell, Cynthia. “Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon.” Computers and the Humanities 30.3, 1996: 257-265.