Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Sunday, Feb 18, 2018 - 00:20 SGT
Posted By: Gilbert


Not sure why we still bother with the local CNY movies - blatant product placement aside, it's basically one public service announcement skit after another... with obligatory anthropomorphic God(dess) of Fortune (remember last year?). As our subreddit has it, the best bits came right at the end, and mainly because it was, well, finishing up.

Babel Nowadays

Tamil too often gets the short end of the stick here
(Though there's more than enough to go around)
(Source: todayonline.com)

I vaguely recall being fascinated by AltaVista Babel Fish a long time ago (it appears, sadly, to have completely vanished, a by-product of having the misfortune to have been palmed off on Yahoo!) [Edit Feb 20: never neglect the obvious: babelfish.com] - type in a phrase, and it'd spit a translation back out, just like that! Back-translation readily exposed its many limitations, but I was easily impressed back then.

The torch has - as with so many online services - been passed to the behemoth that is Google for some time now, and they have to their credit not been resting on their laurels. They've transitioned from more straightforward statistical methods* (shout-out to our old n-gram based input system here), to a deep neural network-based system. As a recent presentation shows, they've not been alone in pursuing this direction.

You'd not be wrong if you guessed what this entails - gobs of data - but there's admittedly technique further involved. Google's 2016 paper on their Neural Machine Translation (NMT) System (open-source version available) describes the usage of Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) with all the optional extras, in an encoder-decoder setup as with Deepfakes. This has lately been upgraded to support direct multilingual translation (i.e. without going through English, as is traditional), by simply including the target language as an input token, leaving all else untouched.

[*N.B. A small diversion on the history of machine translation here. Statistical methods were, probably understandably, viewed with heavy suspicion when they first emerged in the Eighties; Mallaby's More Money Than God recounts the opposition faced by Brown and Mercer, when they first applied the Expectation-Maximization (EM) algo to the task. Jelinek, who later employed the duo at IBM, would counter with his famous retort that "every time I fire a linguist, my system's performance improves"; also, Brown & Mercer would later jump to Renaissance Technologies, where they made like a bazillion bucks, so I suppose they had the last laugh too.]

But a little pulling back of the curtain here. Although official announcements might give the vibe that machine translation is a solved problem, slightly more involved inspection has to reveal that NMT is not quite up to actual human translators yet... and by some distance. Douglas Hofstadter provides an analysis with just such a conclusion in The Atlantic [Hacker News commentary] which covers most of the bases. Borrowing his German example (translations only):


After the defeat, many professors with Pan-Germanistic leanings, who by that time constituted the majority of the faculty, considered it pretty much their duty to protect the institutions of higher learning from "undesirables." The most likely to be dismissed were young scholars who had not yet earned the right to teach university classes. As for female scholars, well, they had no place in the system at all; nothing was clearer than that.

Google NMT

After the lost war, many German-National professors, meanwhile the majority in the faculty, saw themselves as their duty to keep the universities from the "odd"; Young scientists were most vulnerable before their habilitation. And scientists did not question anyway; There were few of them.

Obviously (and unlike the example kindly supplied in Google's research blog), the nuance is well off at a minimum, before going into actual errors in conveying meaning. Hofstadter explains the context behind choices such as "undesirables" rather than "odd", and "Pan-Germanistic" instead of "German-National", but the most interesting miss here was perhaps the feminine case "-in", which completely threw the final sentence off. Skipping to the Chinese example, which I can independently corroborate, the translation of "他仍兼管研究生" as "He still holds the post of graduate student." (output slightly altered from Hofstadter's version; seems like Google has been doing some updating) is indeed egregiously wrong - "He still supervises [his] graduate students" would be the right translation as Hofstadter notes, although it can be noted that changing a single character (to "他仍兼研究生") would bail NMT out.

Which returns us to the fundamental complaint about the ongoing A.I. boom - there's scant actual intelligence, as understood in the popular sense, involved. Some of the issues Hofstadter highlighted are perhaps to be expected, if the NMT implementation operates on a sentence level, as is suggested by the paper (in which evaluation was performed on isolated single sentences). As such, even the simplest agreement and reconciliation of terms between sentences would be absent! In this light, it is perhaps a wonder that paragraph-length texts are even comprehensible.

Despite there clearly being a lot left to be done, I do disagree with Hofstadter on one point. In his wrap-up, he hopes that true translation, artistic translation, by machines, will not be possible soon. But why? Should not the wisdom of the world be available to all, regardless of what tongue they were born with, and despite what gods might fear? Why should Westerners have to wait to read Jin Yong, for example, when Gu Long borrowed unstintingly from James Bond? There are at least two objections I can muster, the first of which is the impossibility of perfect translation - wordplay, for one, carries badly. And then there are the near-ineffables, like sonder...

The second would be the impact on minor languages. It is plausible that multilingualism would become rare, in a world where everyone has their own private translator - why agonize over learning Mandarin (as many local students may question), when one can just translate to it on demand? But then again, it's a trade-off really; if this opens communication between all and sundry - given how much woe lack of mutual intelligibility has historically caused - the loss could be worth it...

My personal suspicion here would be that of a future closing of the circle - while connectionism is currently king, the potential of re-integrating symbolic and domain knowledge has yet to catch up. This may perhaps become more apparent, when the shortcomings of "just pump in more! data" become harder to be brushed aside as saturation approaches, since it's unlikely - to me at least - that current architectures are the end-all on this.

The Age-Old Debate

Guy pays, or split the bill? Our resident alpha male CS prof has the answer, which has as usual sparked lively discourse amongst local netizens (as with his thoughts on the labour situation, which seems to echo latest policy).

Anyway, my two satoshis, for the record:
  1. I do agree: guy pays. However, the bigger issue is probably whether this even becomes a point of contention. It's not a problem if the lady wants to pay her share, but it's a problem if such a minor thing manages to drive a wedge (then again, I may just be conditioned - my hamsters never pay)

  2. A related observation: while academia leans left in general, the hard sciences (and math) do seem to be more conservative than the humanities. The balance in the force has got to be maintained, I suppose

comments (0) - email - share - print - direct link
trackbacks (0) - trackback url

Next: Annual Ritual

Related Posts:
Gift Of The GEB
Week In Review
Economics Thus Far
Final Days In Italy
Staying On The Move

Back to top

Copyright © 2006-2019 GLYS. All Rights Reserved.