by Thomas Bosshard
The next frontier of artificial intelligence is natural language processing. A lot of effort is now going into improving how machines understand human writing and speech. Anyone who has ever tried Google Translate knows how difficult processing human language can be. Siri fails are still common and a frequent source of online ridicule. Some of the machine-generated responses can be rather absurd. For example, in response to the question “I have a gambling problem. Who should I call?”, Siri lists the telephone numbers of casinos in the area or the numbers of liquor shops in response to concerns about having an alcohol problem. However, if the absurdity of the response can somehow be fed back into the system, at some point down the line it may learn what the only possible meaning of the request is.
In school, a popular stereotype to explain differences in talent was always: you’re either good at the natural sciences or at languages. Rarely was anyone talented in both. However, programming, a discipline of computer science, is not very different to using a language. Like the native language that we speak and write in every day, programming languages also consist of a set of rules for combining symbols. In speech and writing, the symbols are the words. Grammar is the rules we apply to string words together to form sentences. Native speakers are usually not aware they’re applying rules because they have internalized them. If we try ourselves in a foreign language, we recall the rules explicitly from memory and apply them in a more conscious manner.
Linguists call the rules to form sentences syntax. In programming, syntax serves the same function. A program’s source code must have a correct syntax for it to work. Lots of coding terminology is shared with or borrowed from formal linguistics. Take for example semantics – the study of meaning in sentences and phrases. Because sentences that are grammatically correct (or “syntactically well formed”) can be nonsense, such as Noam Chomsky’s famous example “Colorless green ideas sleep furiously” (1957), knowing the rules or having internalized them like most native speakers of a language is not enough. To communicate as well as understand a message, we also need to know what the meaning of words is. And this is what makes natural language processing difficult. Words can have lots of different meanings. The word “swing” for example can be a noun or a verb and can either have a neutral connotation like a swing in a playground or a form of jazz music. Or it can have a more sinister or shady connotation like in having a swing at someone. Considering that each word in the English language has about 2-3 synonyms on average, interpreting meaning is not that easy as it seems unfortunately.
Therefore, meaning heavily depends on context. The phrase “avoid biting dogs” is ambiguous. To interpret it correctly, we need to consider the phrases that came before or after it. If these were about mean dogs, then we can be fairly sure about how the phrase was meant. However, they could have also been about mean people, which changes the meaning completely. Linguists call this discourse analysis: We need to look at what comes before and after a sentence to be able to understand it. In other words, we need to apply a more data-driven approach when it comes to semantics.
Natural language processing also applies discourse analysis. It also assumes that sentences are related to each other and identifies hierarchical relations between them. However, artificially intelligent devices like Siri do not have much discourse to analyze. Users only feed them with few-word utterances that can be easily misunderstood. In other words, there is practically no context from which it can gather how a message was meant. In cases of ambiguity or multiple possible meanings, the response would likely be wrong and need to be corrected. For example, in response to the request “I want to swing”, it could locate playgrounds, trees, swinging clubs or dancehalls in the area. However, some AI tools already have the “intelligence” to ask back how something was meant, thus establishing the context needed to provide a useful response. By engaging a user in short discourse, meanings that were not meant can be ruled out.
Age and location of the user can be further contextual hints on which meanings are likely. The same applies to previous history. If the user in the above Siri fail was a frequent visitor of online casinos, it may have been easier to “guess” the meaning with this contextual information.
The intelligence of a system therefore highly depends on the information that is fed into it. Common contextual signals that are used as input for automated responses are previous queries and demographic or geographic information. However, not everyone is willing to share this information with companies that use AI technology. Many people do not like being tracked or wish to retain their privacy rather than have machines “see through” them – a concern that is often mentioned as being a scary prospect for the future.
As with any system, the output can only be as good as the input. And the more input, the more accurate the output. Therefore, any successful attempt to deliver personalized responses requires as certain amount of input – or context information. Many applications, such as Interactive Advisor, ask a few questions about a person or group of persons to establish a profile. The profile is then matched with an offering.
Because of all the opinions voiced on social media, “opinion mining” is a popular application of natural language processing that is gaining a lot of traction. It lives from the premise that brands just have to dig into enough data to find out what the public think of their product. This is called sentiment analysis.
Sentiment analysis has gone beyond like/dislike categorizations and can come up with quite differentiated views on things or people. But here is also where the limits of natural language processing lie. For example, Twitter is full of sarcastic statements, so are product reviews on Amazon. Take a statement like: “I love it when it snows in April”. How can a machine know whether the user meant this literally or sarcastically? In a face-to-face conversation, we have a lot of context to draw from, not to mention the non-verbal cues in a person’s voice or facial expression. So not unless an algorithm has the benefit of an added emoji or #sarcasm, it is faced with a difficult choice.
Not surprisingly, computational linguists have come up with various models of sarcasm that are supposed to help algorithms interpret sarcasm properly, e.g. by juxtaposing a positive statement (“I love”) with an undesirable state (“snow in April”) or activity (“I love shoveling snow”). The only challenge is to identify what an “undesirable state” is. This is where knowledge databases come in. Whereas the human brain “knows” from previous experience that snow in April is an unexpected event and thus undesirable, an algorithm must find this information in a so-called “world knowledge” database (read our blogpost on world knowledge databases).
World knowledge databases basically consist of schematic knowledge, much like the ones that we use to categorize everyday situations. We all know these are mostly overgeneralizations or stereotypical but they help us explain the world around us, and we count on others to understand them. So although there are probably enough people that can also see positive aspects of snow in April, most people don’t. However, in general, people do not like being at the receiving end of a stereotype; they like to see themselves as unique individuals. But that’s what machines do. Therefore, unless they feed an intelligent machine with lots of information on themselves, users can only ever expect stereotypical or wrong responses from them.