Lá “Fail”-a Pádraig* – Paddy’s Day Postscript

On the challenges of spoken language identification in a live setting…..

* Lá Fheile Pádraig – St Patricks Day

The current President of Ireland (Úachtarán Na hEireann) is a poet, university lecturer and former senator from Galway named Michael D. Higgins.

On the Irish national holiday on March 17th, he delivered his annual address in Irish, the national and first official language of the Irish State.

This address was uploaded to YouTube, for which the automatic captions were provided using automatic speech recognition (ASR). YouTube has been providing this service for a number of years now, and although the results are not always stellar, they provide useful input for content-based video indexing, topic modelling and other natural language processing techniques where a non-word-for-word transcription is sufficient.

Of course, the YouTube subtitling ASR system used was an English language one, which resulted in subtitles such as:

s a Lolla pottery longer talk to me to Kayla

which should have been:

“Is é Lá ‘le Pádraig an lá go dtagaimid le chéile…”
St Patricks Day is the day that we come together.

And the rather more embarrassing:

cock merely she can’t play in his scaly

which actually reads:

“…imirceach míle sé chéad bliain ó shin…”
Immigrants one thousand six hundred years ago

More fun mis-transcriptions here, eerily reminiscent of a recent controversy in the Northern Irish Assembly.

Spoken language recognition is a trickier beast than the written-language counterpart, although the technology is in existence and no doubt used for more nefarious purposes than this one, but whether having this technology in the pipeline for YouTube uploads may not be a question for today’s post.

As far as I am aware, Google does not currently deploy an Irish-language ASR system at all, although their Google Translate offering for Irish is not bad (At least in the Irish-English direction), if maybe not fit for official purposes. Official documentation of the Irish state, and also a subset of EU documentation must by law be translated into Irish.

Furthermore with regard to Irish-language speech recognition in general, there does not appear to be any great interest in academic circles either, alluded to by Judge et. al. (2012) who note:

“In the area of automated speech recognition there has been no development to date for Irish, however many of the resources which have been developed for synthesis are also crucial for speech recognition and to this extent the foundations for this aspect of technological development are being laid”

As the language is currently only spoken on a daily basis by an ever-dwindling number of people, the demand for such a system could be rather low. Although there are plenty of Irish language recordings in the wild which could be used in the process of creation, the difference in dialects and pronunciation from the different regions where Irish is still spoken could make training an ASR system difficult, however preserving the language through digitisation and creation of these system could be a very valid step in the preservation of the language. The Phonetics and Speech Laboratory at TCD also has extensive experience in Irish language technology systems although synthesis is the prevailing modality here also.

So cad a cheapann tú (what do you reckon) Google? Maybe by next Paddy’s Day, you could train up a small Irish-language speech recognition system and give Michael D. the subtitling he deserves!

Either that or upload a manual translation in advance, but sure where’s the fun in that.

Is maith an scéalaí an aimsir.
(lit) Time is a good storyteller.
Time will tell.

References

John Judge, Ailbhe Ní Chasaide, Rose Ní Dhubhda, Kevin P. Scannell, and Elaine Uí Dhonnchadha. An Ghaeilge sa Ré Dhigiteach – The Irish Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Springer, Heidelberg, New York, Dordrecht, London, September 2012. Georg Rehm and Hans Uszkoreit (series editors)

Advertisements

Sorry for your Troubles : Hiberno-English and a history of euphemism.

In which the author gives out* about the Hiberno-English tendency towards euphemism

*tabhairt amach
(lit) to give out
to complain about something

Níl aon tinteán mar do thinteán féin.
(lit) There is no fireside like your own fireside
There’s no place like home
Irish proverb

A topical post for the day that’s in it (An lá atá inniu ann – given the special occasion of this day (being St Patrick’s Day)),

I was reminded recently about the Irish skill at avoiding the more delicate subjects of conversation, despite being the supposed keepers of the gift of the gab.

Take two of the more turbulent periods in 20th century Irish (and world) history, The Second World War and the Conflict in Northern Ireland.

In Hiberno-English and historically in Ireland, these are referred to as The Emergency and The Troubles respectively. The latter in particular sounds to me like a very understated description of a 30-year-long armed conflict claiming over three thousand lives, with the former evoking a much less serious event than the deadliest conflict in the history of humanity.

Of course, Ireland’s neutrality and the ever-present loss in translation between the Irish language and English may have influenced the nomenclature in the case of The Emergency at least, although other neutral countries did not skirt the issue in their native language. Neutral Switzerland referred to the period as Grenzbesetzung (Border Occupation) 1939–45,  although their closer proximity to matters at hand may have played a role in this description.

Jennifer O’Connell writes in the Irish Times (paywall) about the supposed Irish humility and love of the word sorry, remarking that:

That’s because when an American – or an English person or an Australian – says “sorry”, they usually mean they are sorry. Out of the mouth of an Irish person, ‘sorry’ can mean anything from “get out of my way” to “I didn’t hear you” to “I’m sad for you”. It is rarely used to denote an actual apology, probably because we are seldom wrong.

At the same time, the long-windedness of Irish English can appear deferent, along with leading in the negative, as in:

You wouldn’t fancy a pint now, would ya?

You wouldn’t be wanting to head out out, would ya?

The Irish language’s lack of words for yes and no gave Hiberno English its delightful ’tis as an affirmation, and a conspicuous absence of the word yes, allowing Irish politicians to skirt delicate issues with ease.

Another unfriendly Hiberno-English euphemism that has risen in recently years has been the teeth-gritting epithet non-national, a softened foreigner who has probably been resident in the Auld Sod for more years than he or she cares to remember and could even hold citizenship of the Emerald Isle but happens to have a place of birth which is not the Land of Saints and Scholars.

Indeed, the term has been ridiculed in both traditional and online media, with a possible interpretation of denoting a form of statelessness akin to Tom Hanks’ character in The Terminal, which of course was not the intended meaning, being an abbreviation of the Irish Immigration Bureau’s officialese term non-EEA-national.

At risk of sounding like a typical Irish begrudger, I’d like to say my piece by reminding that despite its fondness of euphemism and other foibles, Hiberno-English has given the world the judged-Nobel-worthy poetry and prose of Yeats, Heaney, Shaw and Beckett and the equally-Nobel-worthy-IMHO writings of Joyce, Synge and Swift, the words smithereens, phoney and through the Anglicisation of my own Gaelic family name O’Loinsigh, a handy term for mob murder found in many languages worldwide!

Happy Paddy‘s Day everyone!

Do Androids Dream of Eclectic Tweets* : A Brief Report on the PROSECCO Code Camp on Computational Creativity

On how we got computationally creative in Coimbra…

Nenhuma ideia brilhante consegue entrar em circulação se não agregando a si qualquer elemento de estupidez.

No intelligent idea can gain general acceptance unless some stupidity is mixed in with it.

Fernando Pessoa (1888-1935)

*Nod to Tony Veale’s RobotComix

In January of this year I made my (long overdue) first ever trip to the beautiful land of Portugal, for the inaugural Code Camp on Computational Creativity which was generously sponsored by the EU PROSECCO COST Action.
The camp took place in the picturesque locale of Coimbra, the Oxbridge of Portugal and home to one of the most venerable universities in the world (established in 1290).
In this lush setting of the Cognitive and Media Systems Group of the Department of Informatics Engineering, Faculty of Sciences and Technologies, fueled by an abundance of midday wine and pasteis de nata provided by the ever generous Prof. Amílcar Cardoso and his local team, we set about the task of creating creative Twitterbots, intelligent machines displaying a flair for the linguistic.

20150112_081058

A wide range of interesting bots were brought to life with the help of computational creativity experts including, among others, Dr. Tony Veale of UCD, Dr. Simon Colton, Games By Angelina founder Michael Cook of Imperial and Goldsmiths, the Right Honourable Dr. Graeme Ritchie from Aberdeen and an expansive cast of mentors named after Tintin characters such as Captain Haddock, Nestor, and Snowy.

Among the camp attendees were a smattering of computational linguists, some computational creativity researchers, a handful of games designers and even a few visual artists to complete the rich milieu of motivated disciples of the cult of computational creativity. Many of those in attendance (myself included) appeared to have come to the field through a side-project or after-hours guilty pleasure in tandem with their fulltime gig, which made for a very pleasant camp atmosphere.

The bots ranged from an automated riddle generator, a rap-battle bot, a computationally creative movie article tagline generator, a “call and response” conversation bot, a “Cards Against Humanity” interactive bot to a sad-sack distressed self-pitying bot (our team’s contribution) (Code camp Twitter list here).
One unique element of the camp proceedings was the focus around a shared knowledge base, the so-called NOC list provided by Tony Veale. This was an exhaustively hand-collated list of properties and relationships for a number of personages, both fictional and non-fictional.
This resource enabled the bots to demonstrate human-like metaphorical capabilities, such as the following:

Some reflections

In addition to being my first time in Portugal, this was also my first hackathon, and I very much underestimated the amount of effort required to get a creative Twitterbot off the ground in a few days. The team behind the Movazine bot had a particularly challenging task at hand to bring together a number of rather complex NLP processing frameworks  to come up with their witty analogies. The pattern library from Tom De Smedt helped immensely however, as did the pizza and cake infusion provided by the local organisers.

A possible next step in the Twitterbot hierarchy of evolution, a development from mere generation (v1.0) and our next-generation bots who use knowledge bases (v2.0) would be bots which can skilfully combine dynamic information from Twitter (similar to the Convo bot pair) and other social streams with user-curated information such as the NOC list or perhaps dynamic but structured data sources such as Freebase or ConceptNet, inserting themselves into live conversations with wild abandon. Feedback from participants in the camp indicated that bots incorporating other modalities such as image and video would be of great interest for a future event in this space.

Spare ideas

During the brainstorming process, I came up with some bot ideas which in the end were not developed. Hopefully someday I’ll get to create them, but if not, I’d be happy if anyone wanted to take them on board.

ObitUBot

A bot that tweets short Twitter obituaries for inanimate objects such as the floppy disk, or short-lived former countries/empires, e.g.

In Memoriam : The floppy disk, 1960-2010, purveyor of bits, facilitator of data transfer, not actually as floppy as the name suggests…..

This could be based on a number of Wikipedia Category pages such as:

http://en.wikipedia.org/wiki/Category:Legacy_hardware
http://en.wikipedia.org/wiki/Category:Former_countries_in_Europe
http://en.wikipedia.org/wiki/Category:History_of_telecommunications

and some interesting textual templates mined from real world obituaries, topped off with some NLP wizardry.
The bot could tweet a link to the Wikipedia article referenced within each tweet, raising awareness of arcane European principalities while providing humorous insights into same.

Talking Bot My Generation:

This bot is one which I am very much hoping to put together at some point, inspired by the current metaphorical comparison iteration of Tony Veale’s Metaphor Magnet Bot and it is can be loosely conceived as something along the lines of an automated literary or film critic.
Preloaded with a hand-curated list of 20th century cultural luminaries (and not-so-luminaries) separated by decade and category (film, literature, popular/highbrow, actor, musician), this bot would tweet pop-literary-critical musings such as:

If #DonDeLillo is like the #JamesJoyce for the #BabyBoomers, then what is his Portrait of an Artist as a Young Man?

#MickeyRourke; The Yuppie #BradPitt or more of a Reagan-era #ChristianBale?

Is #TheSunAlsoRises #TheBonfireoftheVanities for the #GreatestGeneration, or more of a Thirties #LessThanZero?

Further resources:

Tutorial by Tony Veale on Creative Twitterbots

http://www.slideshare.net/kimveale/tutorial-on-creative-twitterbots

Mike Cook’s musings on Twitterbots and cultural taboos

http://www.gamesbyangelina.org/2014/09/appreciating-bots/

Guardian article on computational creativity, which gives a shout out to MetaphorMagnet, comparing one of its tweets to early Emily Dickinson poetry…..
http://www.theguardian.com/books/2014/nov/11/can-computers-write-fiction-artificial-intelligence

Podcast interview with a veteran of creative Twitterbots, tinysubversions’ Darius Kaziemi

http://www.theguardian.com/technology/audio/2014/dec/17/darius-kazemi-bot-tech-weekly-podcast

The Lines of Others: Stylometry on film.

In which the author analyses a question of authorship in the movies..

I remember watching the excellent 2006 Oscar-winning German-language film Das Leben Der Anderen (The Lives of Others) some time after release on DVD, and couldn’t help thinking of an application of textual stylometry. A crucial early scene (spoiler alert!) shows the East German Secret Police’s finest minds attempting to identify a mystery journalistic voice publishing polemics detailing the high suicide rate under the DDR regime in Der Spiegel, at the time a (West) German current affairs publication.

Having obtained copies of the typewritten drafts of the articles in question, they hit upon the bright idea of analysing the stylistic fingerprints of the typewriter used by our mystery Robespierre of the Karl Marx Allee, as all typewriters in the DDR have to be registered upon purchase. Upon narrowing it down to two particular models, the Stasi are suddenly stumped as they cannot find a record of this model on file and there the trail suddenly goes eiskalt.

At this stage in the film they resort to their good old-fashioned “soft-assets”: their manned listening posts who are busy keeping tabs on all and sundry, however it is probably fortunate that they didn’t have a little more expertise in the science of stylometry or Florian Henckel Von Donnersmarck’s feature debut could have taken a rather different direction after the second act.

Had they been aware of the body of research on the stylistic fingerprint of an author’s text, the typewriter studies could have been shelved and the focus instead could have been placed on the article text itself, comparing the function word frequencies, punctuation and prose forms therein to profiles of any prominent possible subversives of the day, including Sebastian Koch’s agitator/dramatist Georg Dreyman.

Authorship : A potted history

The question of textual authorship has interested scholars since the Classical Era, and this rich tradition has informed modern stylometric methods to a significant extent. Authorship attribution seeks to identify the authors of unknown texts, usually in comparison with known attributions, and many successful methods focus on stylistic aspects of textual production such as vocabulary choices, punctuation and textual complexity.

The first modern mechanized application of literary stylometry was the Mosteller and Wallace attribution of the Federalist Papers in the 1960s, an authorship attribution whodunit with roots in the nascent United States of America.

The 77 papers, the majority of whom have uncontested authorship, were written as a proposition towards the establishment of the US constitution. However, in the case of a number of the papers, the true authorship had been lost in the mists of time.
Their analysis identified trends in the function word distribution of a number of the anonymous papers which matched those already attributed to James Madison.
A selection of the function words examined are shown below, interestingly the words which would normally be discarded when doing topic-based text classification.

A, do, is, or, this, all, down, it, our, to, also, even, its, shall, up
an, every, may, should, upon, and, for, more
so, was, any, from, must, some, were ,are ,had

This study was replicated a number of times since the original study in 1963, with recent work by Fung (2003) verifying the Madison attribution using a support vector machine classifier and the same set of 70 function words.

On a similar topic, a controversial study by Smith (2008) investigated the authorship of the American Declaration of Independence, which postulate that Thomas Paine of Rights of Man fame, may have had a more significant hand in the drafting of the document, a view that ruffled many feathers in the US academic establishment, according to the (British) author.

The good gentlemen of the Staatssicherheit need not be worried about the applicability of these methods on other languages either, recent work by Eder (2011) investigates the efficacy of authorship attribution features across a number of languages, including German.

Und so endet die Geschichte.

References:

Eder, Maciej. (2011): “Style Markers in Authorship Attribution.” Studies in Polish Linguistics 6 99-114.

http://www.wuj.pl/UserFiles/File/SPL%206/6-SPL-Vol-6.pdf

Smith, P. W., & Rickards, D. A. (2008). The authorship of the American Declaration of Independence. In AISB 2008 Convention Communication, Interaction and Social Intelligence (Vol. 1, p. 19).

Fung, G. (2003). The disputed Federalist Papers: SVM feature selection via concave minimization. In Proceedings of the 2003 Conference on Diversity in Computing (pp. 42-46). ACM.

Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association,58(302), 275-309.

Faraway – So Close

On distant reading, millions of books and verifying what we know (and don’t know) in computational stylometry.

Wenn du eine weise Antwort verlangst, musst du vernünftig fragen.
If you crave wise answers, then you must ask reasonable questions
Johann Wolfgang Von Goethe.

I attended a very interesting talk by Prof. Gerhard Lauer, Chair of Germanistik at Göttingen University at the TCD Long Room Hub Institute in Dublin on Wednesday entitled:

“Reading with Machines : Towards computational literary criticism”

In his talk, he delivered a fascinating treatise on computers as a research tool in the humanities and indeed research tools in general throughout history, from Leeuwenhoek’s portable microscope and Blumenbach’s anthropological taxonomy via TEI, stylometry and a wide range of related digital humanities topics and case studies.
The large number of humanists mostly German literary scholars) in attendance appeared particularly interested by Prof Lauer’s own work on classifying the Germanic literary canon using stylometry.

An excellent question was raised by Prof. Jurgen Barkhoff, fellow Germanist and Director of the Long Room Hub at TCD, paraphrased by me as:

The computer can tell us what we already know, vis-a-vis the canon of Kleist and why Kaethchen is different to his other works, but when analyzing millions of books, how can we possibly verify computational analyses on that scale?

This question from an eminent humanist raises a valid point, so far digital stylometry has sought to verify facts that we already know about literary works, that these texts are all written by the same author, gender or even the fact that these texts are part of the Enlightenment or Realism movements, but when we are engaging with thousands and indeed millions of texts, the whole literary canon of a culture for example, how can we even begin to determine whether the computational analysis holds water?

I was reminded, indeed both implicitly and explicitly in the talk of Prof. Matthew Jockers, who I had seen speaking in the same venue a number of years hence, on comparing the themes of the literary canon of Ireland with the canon of North American and British literature . These large-scale stylometric analyses attempt to synthesise hundreds of years of literary studies into a single dendogram or cluster graph, and can still be verified by consensus in the literature, but imagine comparing the literary canon across multiple cultures over multiple centuries?

During my brief but very pleasant chat with Prof. Lauer afterwards, he was delighted to inform me upon discovering my own academic lineage that it was in fact the computer scientists at Göttingen who were making a big push towards digital humanities collaborative research, rather than the humanities department, who in Göttingen at least were rather more traditional in their outlook.

Very refreshing to see many interested and engaged humanists, both digital and traditionelles at such an event and this bodes well for the future of digital humanities scholarship here in Ireland in the future.