In which the author investigates the uncanny style of translated text using machine learning and finds, among other stylistic features, evidence of national stereotypes in literature.
Poetry is what get lost in translation
Reading a poem in translation…is like kissing a woman through a veil.
When reading a translation, particularly a literary one, one often becomes aware of a certain unheimlich nature to the prose, as if an unknown force is dragging down on the sentence structure and creating an eerie hint of disfluency. The old cliché speaks of meaning and feeling being lost in translation, but we can also view a translation as having imperceptibly gained a certain something. Thus, the textual style of a translation is neither that of the source language or the target, but a language apart, often referred to as the third code. Translation studies scholars also use the term translationese to describe the subset of a language consisting solely of translations into that language, and computational linguists have become curious of the properties of this uncanny dialect.
A large proportion of translation-related research in the field of computational linguistics focuses on training machines to do translation, so once researchers (Kurokawa 2009) figured out that the direction of your parallel corpus could be useful for MT (DE-EN corpus direction for DE-EN translation, for example), attention was again given to this subject of translationese and translation direction detection which had attained something of a academic “cult following” among mostly translation studies scholars.
During my doctoral work I focused heavily on the stylistic properties of translations, and one of these properties concerned the source language traces found in texts. This is especially prevalent when one is familiar with the source language in question:
e.g. “This sentence like German reads”
The task of detecting the features which illustrate this was a interesting challenge for machine learning tools. A fine-grained problem such as this one locates itself within the realm of stylistic classification, alongside questions such as native language identification, personality detection, sentiment analysis, gender detection, temporal period classification and others.
As the existing literature focused on commonly used corpora such as EUROPARL (van Halteren 2008) and large multi-lingual newspaper collections (Koppel and Ordan 2011), we decided to examine literary text, an oft-neglected genre in traditional text analytics. The first port of call for literary texts is usually Project Gutenberg, although this usually limits one to texts from a specific historical period. In order to keep all confounding factors constant, the texts were all drawn from 19th century prose translations. The source languages examined were Russian, German, French and original English, and five novels were obtained to represent each set.
In order to negate any effect of individual translatorial or authorial style, no translator or author was repeated in the corpus. This work represented the beginning of my analysis of the dichotomy of textual features. N-grams, or sequences of characters, words and parts-of-speech are a potent force in text analytics. On the other hand, a text is made up of more than discrete sequences of characters, and another set of features can capture larger deviations of style. This set of metrics includes readability scores, used to measure textual complexity, metrics such as type token ratio and the ratio of various parts of speech to total words.
With the corpus assembled, random chunks of text were extracted from each work. Each textual chunk consisted of two hundred kilobytes of text, and this chunk was further divided into five equal sections. Each source language was represented by five works, these can be viewed in the paper.
Using the eighteen document metrics only, the system results were relatively low. Given that the classification problem was a four-way affair and all classes were balanced, an accuracy of 67% was obtained using a Support Vector Machine classifier the ten-fold cross-validation method. Comparably using word unigrams features only and doing feature selection within the cross-validation folds, the system reported accuracies of nearly 100 percent, dropping off at the top 100 word features.
The mostly highly ranked features in the word unigram set consisted unsurprisingly of content words. Words such as francs, paris, rue and monsieur characterized translations from French, with German texts talking of berlin, and characters named von <something>. The Russian translations were similarly marked with content words such as cossack and names such as Anton, Olenin and the like.
To test the robustness of the classifier, the top 200 features in a mixed set of bigrams, POS bigrams and word unigrams were selected, and all nouns were removed. Using the remaining fifty features and a Simple Logistic Regression classification algorithm, the sparse feature set managed an accuracy result of 85.5%.
A little untoward
The adverbs toward and towards were found to be discriminatory for texts translated from German, and this was where a confounding factor in the data came into play. The translations of the German texts had all been published in the US where the term toward was more prevalent than towards. This particular trait is perhaps not associated with source language as strongly as other frequent word tokens.
A contraction in terms
Contractions were found to be discriminatory, Russian had higher frequencies of the contracts it’s and that’s than French, which reported higher frequencies of the non-contracted form. In a contradictory fashion, translations from Russian reported a higher frequency of both I am and I’m than the other texts, French and German reported higher frequencies of the non-contracted forms.
Exact reasons for this behaviour remain difficult to pinpoint. On the one hand both French and German display the first person be form as two words (Ich bin and je suis) which could influence the translated text more towards the expanded form. On the other hand a Russian-speaking conference attendee who saw the work at Coling didn’t seem to think the Russian examples were related to any Russian language-specific transfer and the source of this phenomenon remains to be investigated.
The adverbial conjunction
The translations from French reported a higher frequency of the POS n-gram RB-CC which translates as an adverb and coordinating conjunction pair. The below figure shows extracted occurrences of this n-gram in the French translation corpus, based on the following simplified text search:
grep “ly and” FrenchCorpus.txt
We can even see that in a number of cases the coordinating conjunction joins pairs of adverbs (RB CC RB) together, which may represent a grammatical structure more common in the original French.
…..No head was raised more proudly and more radiantly….. …..an offer which she eagerly and gratefully accepted….. …..unceremoniously and with no notice at all……. …..But after this I mean to live simply and to spend nothing….. …..I placed myself blindly and devotedly at your service…… …..Outwardly and in the eyes of the world …..They had parted early and she was returning home…… ……as the English law protects equally and sternly the religions of the Indian people….. ……vain attempts of dress to augment it, was peculiarly and purely Grecian…….
Other source-language distinguishing frequent tokens included:
- Russian: anyone, though, suddenly, drink (allowing one to indulge a national stereotype 😉 )
- French: resumed, towards, thousand, (apparently related to the denomination of francs)
- German: nodded
- English: presently, sense (common or otherwise)
Regarding the document metrics, the following were discriminant:
- Russian: Ratio of finite verbs (higher) ARI readability score (lower)
- French: Ratio of nouns (higher) , ARI readability score (higher), ratio of conjunctions (higher)
- German: Ratio of prepositions (lower)
Testing the waters
The document-level trained model was tested on an unseen corpus of contemporary literary texts. The model managed only 43% accuracy, compared with the 67% cross-validation result, however this was still above the baseline of 25%. A more training set comprised of a larger and more diverse range of texts may results in a more robust classification model.
This study focused on detecting patterns in literary translated text indicative of source language. The study identified a number of such effects both in terms of ratios of different parts of speech combinations, frequencies of individual common words and also frequency of content words. A classifier was trained which performed well on the training set but exhibited a lower accuracy on an unseen test set of fresh literary translations.
Of course, this work is still at a relatively nascent stage. The coarse grained nature of the corpus (five texts per language only) meant that any features learned could be heavily biased towards those particular texts, as seen in the evaluation results on the hold-out unseen set. A larger set of source texts would have resulted in a more generalizable model.
An expansion on this theme currently under peer review focuses on an expanded feature set and brings syntactic parse features into play on a larger (8 language + ca. 400 text) set of contemporaneous translations which can hopefully capture a deeper sense of structural transfer from the source language.
Content words have been ignored in this study as it was believed that they tended to capture topical distinctions rather than stylistic idiosyncracies. This distinction may be somewhat crude, as it can be difficult to separate what are topical norms, what are cultural norms and what are trends of linguistic transfer?
Obviously, mentions of Muscovites and cossacks might lead us towards Russian as a likely source language for a translation, but these features are not robust, as plenty of Anglophone authors (for example) may also write novels set in those regions. Likewise with the higher frequency of drink and snow. These can reflect socio-cultural norms, perhaps more frequent in texts from a particular literary tradition but do they represent true source language effects?
Perhaps applications of textual clustering metrics such as LDA will shed light on topical clusters related to culture and tradition in literary corpora.
This experiment sought to shed light on source-language specific traits in translated text. A number of these traits were found, although cross-corpus testing indicates that some may be more specific to the specific literary works examined than the source language.
- Replicability: A comparable experiment was carried out by Klaussner (2014) who examined a completely distinct set of parallel literary translations. Some differences here were the use of POS trigram features, a number of which were found to be indicative of original English. Contractions were also identified as markers of source language, together with ratios of conjunctions, average word length and type-token ratio
- Comparability: One interesting additional step taken by Klaussner (2014) which was also done by Baroni (2006) was to present humans with the task of classifying whether a text was translated or original. Baroni (2006) found that the machine was generally more consistently accurate than the average human, although one of the ten human evaluators (an expert in translation studies) outperformed the machine. Klaussner found that human evaluators performed the task of translation vs. original classification with ca. 70% accuracy on seventeen excerpts of translated and original text, with a Kappa score of 0.406, indicating moderate agreement.
- Transferability: A hot topic in the machine translation and computational linguistics community currently is the idea of quality estimation, roughly “Automatically detecting how bad a machine translation is and how much it might cost a human to fix it”. Approaches similar to those used here could be used to determine how similar a translation is stylistically to a corpus of translations from the same source language, with one train of thought imagining that the more stylistically similar to the source a translation is, the more post-editing required?
- Expandability: Another area of research seeking to identify textual interference features is the field of native language detection. This seeks to identify native language influences on an author’s L2, with various applications for second language learning and author profiling. An interesting experiment could compare non-native writing and translation from the same L1 to investigate similarities/differences related to transfer.
Linguist list post about translationese
Baroni, M., & Bernardini, S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing.
Kurokawa, D., Goutte, C., & Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings. MT Summit XII, The twelfth Machine Translation Summit International Association for Machine Translation hosted by the Association for Machine Translation in the Americas.
Klaussner, C., Lynch, G., & Vogel, C. (2014). Following the trail of source languages in literary translations. In Research and Development in Intelligent Systems XXXI (pp. 69-84). Springer International Publishing.
Koppel, M., & Ordan, N. (2011, June). Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1318-1326). Association for Computational Linguistics.
Lynch, G., & Vogel, C. (2012). Towards the automatic detection of the source language of a literary translation. In 24th International Conference on Computational Linguistics (p. 775).
van Halteren, H. (2008). Source language markers in EUROPARL translations. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1 (pp. 937-944). Association for Computational Linguistics.