Text was considered relatively safe from adversarial attacks, because, whereas a malicious agent can make minute adjustments to an image or waveform of sound, it can’t alter a word by, say, 1%. But Prof. Alex Dimakis of Texas ECE and his collaborators have investigated a potential threat to text-comprehension AIs.
The research was led by UT student Qi Lei and collaborators at IBM Research and Amazon. The study was published in SysML 2019 and covered by Nature News.
Previous attacks have looked for synonyms of certain words that would leave the text’s meaning unchanged, but could lead a deep-learning algorithm to, say, classify spam as safe, or fake news as real or a negative review as positive.
Testing every synonym for every word would take forever, so the researchers designed an attack that first detects which words the text classifier is relying on most heavily when deciding whether something is malicious. It tries a few synonyms for the most crucial word, determines which one sways the filter’s judgement in the desired (malicious) direction, changes it and moves to the next most important word. The team also did the same for whole sentences.
A previous attack tested by other researchers reduced classifier accuracy from higher than 90% to 23% for news, 38% for e-mail and 29% for Yelp reviews. The latest algorithm reduced filter accuracy even further, to 17%, 31% and 30%, respectively, for the three categories, while replacing many fewer words. The words that filters rely on are not those humans might expect — you can flip their decisions by changing things such as ‘it is’ to ‘it’s’ and ‘those’ to ‘these’. “When we deploy these AIs and we have no idea what they’re really doing, I think it’s a little scary,” Dimakis says.
Making such tricks public is common practice, but it can also be controversial: in February, research lab OpenAI in San Francisco, California, declined to release an algorithm that fabricates realistic articles, for fear it could be abused. But the authors of the SysML paper also show that their adversarial examples can be used as training data for text classifiers, to fortify the classifiers against future ploys. “By making our attack public,” Dimakis says, “we’re also making our defence public.”