Sayyy whatttt?: Researchers analyze strange human tweets to build better AI

Twitter is a weird place with a language all its own. A new study focuses on using the tweets, chock-full of misspellings and slang, to improve artificial intelligence.

Human language includes so many intricacies. The emphasis of an individual syllable, the peculiar lilt of a given word, the tone; all of these things provide unique clues to the intended meaning of a sentence and, at times, even the mood of the speaker. The plot thickens and the risk of indeterminacy only increases with written languages, sans the sights and sounds of verbal communication. A new study analyzed a regular treasure trove of written language a la Tweets to give artificial intelligence (AI) better insights into mankind’s mysterious linguistic ways. The study titled, “Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings,” was published in PLOS One earlier this week.

More about artificial intelligence

Balance and stretched language

Overall, the researchers analyzed about one billion tweets sent out within an eight-year span. Within this mixed bag o’ tweets, the team specifically looked at the overall balance and stretchiness of the language. What does that mean? “Heyyy” is a commonly stretched word with little balance, whereas “hahahaha” exists as a stretched word with a balance to the overall stretchiness. This stretch and balance adds intrinsic meaning and sentiment to the typed language.

“Written communication has recently begun encoding new forms of expression, including the emotional emphasis delivered by stretching words out,” said Chris Danforth, professor of Mathematics & Statistics in the Vermont Complex Systems Center and member of the research team behind the study.

SEE: Managing AI and ML in the enterprise 2020: Tech leaders increase project development and implementation (TechRepublic Premium)

Tweaking the training

To compensate for the lack of accompanying cues in written language, humans use all of the available artillery in our limited keyboards. We incorporate ALL CAPS to add volume to the muted words. We pretend to keyboard smash to illustrate frustration sending out a potpourri of nonsense into the digital void. We do… other things.

“With so much communication happening electronically these days, we’re all trying to find ways to convey emotion through text. Emojis are helping, but the visual effect of 30 consecutive vowels in a curse word turns a bland profanity into a form of art,” Danforth said.

Current AI systems are often trained on the unsullied language reserved for textbooks and evening news bulletins. A peek behind the unpolished curtain, so to speak, at the verbiage of social media, is doubly befogging for a machine learning system still green to human language. Tweets represent “non-fictional data” and provide a look into the unkempt underbelly of human linguistics.

“Many machine learning tools in Natural Language Processing are based on niche corpora like online movie reviews or news articles. You won’t find “ahhhhgggggggg!!!!” in the Wall Street Journal. AI trained on more realistic expression has a better shot at understanding intent,” Danforth said.

SEE: Building the bionic brain (free PDF) (TechRepublic)

Building smarter algorithms

A simple scroll through someone’s messages or Twitter quickly illustrates that the standard language of text and tweet varies considerably from, say, the front page of a newspaper. Increasing AI familiarity with our standard communication and a more comprehensive take on intent is key to assisting in a host of applications moving forward.

“Dictation software, suggested completions, and autocorrect all rely on smart algorithms capable of predicting what characters to print next. They are generally quite bad, language is hard, but a proper taxonomy of realistic emotional expression will help,” Danforth said.

Perhaps as AI gains a better understanding of our communication habits this will spawn its own linguistic response, throwing an indeterminate wrench into the proverbial gears of machine learning.