A recent blunder committed by an automated translation engine was missed by the Spanish Ministry of Industry, publishing a press release on their website in which they translated “Dolores del Campo”, Spanish Ministry of Industry, into the puzzling English version “it is pain of field.”
However, amusing and ludicrous as the translation may seem, it does, in fact, reveal more to the eye than a simple computational error. The absurd mistranslation carried out by the engine resulted from its inability to grasp or recognize that Dolores del Campo was, in fact, a proper noun. Ironically, Euronews, while reporting on the toe-curling mishap, also revealed an odd translation error linked to the minister’s pronouns. When referring to Dolores del Campo, the article (shown below) stated that she “saw its name translated into the English version as “it is pain of field” as opposed to “her name”.
Handling proper nouns and correctly detecting gender are fundamental technological challenges for Machine Translation (MT). Typical MT systems focus on accuracy and fluency, with many algorithms able to serve out translations of complex idioms and colloquial language. i.e. Google Translate’s “¿Qué pasa tío?” in Spanish to English translation “What’s up, man?” (albeit not achieving the inverse). However, where traditional MT systems tend to fall flat, is when dealing with content words that include important information, such as names. Name mistranslation occurs when an algorithm comes across an unknown word, is trained on noisy parallel data, or mistakes a name such as Dolores Del Campo, for a common noun. In effect, translation of proper names often requires different methods and approaches to those used for translating other types of words.
A first step towards solving this issue is by looking at Named Entity Recognition (NER). NER is an information extraction technique that can tag sequences of words in a text that are considered “real-world entities”, such as people, organizations or places. Automatically detecting and labeling entities can be useful for companies that generate large amounts of data, for instance, news and publishing houses. Hierarchical news story categorization and smooth content discovery based on recommendations, can be achieved through NER. Named Entity Recognition and its ability to single out proper names can also be applicable in the translation process.
By replacing a proper name with a temporary placeholder (a symbol subsequently replaced by a value or string), an MT system can be trained to detect the placeholder and maintain it. During the post-process, the placeholder is then replaced by the original name. At PangeaMT, our technical research team is using Named Entity Recognition to detect figures and numerical expressions in Chinese, to then convert them to Arabic numerals. Our team also works on anonymization of proper names and locations to comply with GDPR and client requirements, which can also be achieved through NER.
However, Named Entity Recognition does not provide an answer to the question of gender detection through Machine Translation. At present, MT engines determine gender based on the context of a word in a sentence. However, this can be tricky, since the majority of current systems translate sentences in isolation. This means that important gender-related cues can be missed from the rest of the text. What’s more, when unable to disambiguate a sentence, the MT system will default to the most likely output based on the data it is trained on.
This has prompted concerns over automated systems reflecting certain asymmetries and prejudices in society. For example, when translating a neutral word such as “nurse” from English to a more strongly gender-inflected language such as Spanish, the automated output is more inclined to select the female “enfermera”. This is because the frequency of “enfermera” in the text corpora tends to exceed the masculine form “enfermero”.
Moreover, another problem that enhances this machine bias, is that algorithms mostly default to masculine pronouns, as these are over-represented in the large text corpora they are trained on. Cornell University produced a paper that listed an array of job positions from the U.S. Bureau of Labor Statistics (BLS). By building a list of sentences, such as “He/She is an Engineer” in neutral languages such as Chinese and Hungarian, they used the Google Translate API to translate the sentences into English. The result demonstrated a strong tendency towards male defaults, especially in particular fields linked to unbalanced gender distribution such as STEM jobs.
On November 27th, Reuters reported that Google had removed gender pronouns from their phrase suggestion feature for Gmail. Smart Compose, a Natural Language Generation (NLG) feature, studies patterns and relationships between words from emails and web pages amongst other sources, in order to predict and construct a likely sentence suggestion. In January, a company scientist detected the bias when he typed, “I am meeting an investor next week,” and Smart Compose put forward a suggestion “Do you want to meet him?” instead of “her”. With past backlash from its other AI tools, namely, Google photos mistakenly classifying black people as gorillas, Google has opted for a quick solution to Smart Compose’s apparent gender bias.
However, although a conservative approach, Google’s decision to simply pull out gender pronouns from Smart Compose’s suggestions, does not necessarily debias the algorithms their AI tools rely upon. Other tech giants, such as Microsoft, have also removed gendered pronouns from LinkedIn’s Smart Replies feature. However, today’s machine learning systems are all but self-serving. Rather, they are merely projections of data that have been fed by humans. Therefore, simply blocking pronouns for inadequate algorithmic judgments does not clear up the larger problem. Especially, if AI systems are making decisions in domains such as employment, healthcare and the legal sector.
Technology is indeed a double-edged sword, the risk of unconsciously increasing bias and discrimination through technology is a possibility if we do not design algorithms with an awareness of gender and race to begin with. Innovating and retraining MT models on refreshed data sets is crucial to the development of the AI industry. The investigation and development of name-aware Machine Translation such as City University’s in New York, are steps in the right direction. As well as models like the one supported by the Swiss National Science Foundation (SNSF), where the algorithm tracks information contained elsewhere in the text, such as gender cues.
However, unlike tech companies and providers of machine translation technologies, freelance translators and translation companies do have a golden ticket. A crucial step to avoid perplexing discrepancies such as that of “pain of field” is to ensure a document goes through a post edit and revision phase. Automated or translated by a human, quality control plays a fundamental role in any professional translation service. Take a look at our 12 tips for translators to provide quality translations.