3 Reasons why Text Analysis isn't like Data Analysis

The words “text analysis” have the same ring to them as “data analysis”, obviously it's the “analysis” word. But it is a significant mistake to think that text analysis is like data analysis. Text is a form of data but it isn't the same as traditional numeric data.

Here are 3 reasons why text is different from traditional numeric data.

(1) We need to throw away some text data to get a better analysis of the remaining text data.

Language has a lot of different types of words. Common words such as “the” or “at” or “this” help provide a narrative structure when we speak or write, yet they can be impediments to the analysis of text. For instance if we want to analyse common phrases in body of text, so called “ngrams”, we won't find much value in the trigram “at this the” or “I go when”. Yet these types of trigrams will appear as very common when you analyze text. That's why there is the concept of “stop” words, that is words that we remove from the text before we analyze it to make sure more meaningful combinations of words can be seen. What we may want to see are phrases like “local grocery store” but unless we remove stop words these phrases are a long way down the list of trigrams. Stop words are a balancing act because you may remove meaningful information with a too long stop word list. You may also want to tailor the stop words to a particular type of text you are analyzing for instance customer reviews versus film reviews. A final consideration is that common words do contain information, they are called “functional words” and give an insight into the narrative structure of the text. The word “because” is a good example of a word that may be used as a stop word yet obviously shows a certain sort of narrative structure in the text.

(2) One means one but “duck” can mean many different things.

In a survey if a respondent enters a value in response to a question, for example how many cars they own, it has a single meaning. If they respond with “2” to the number of cars they own it means they have 2 cars, not 3 or 4, not 1 or 0 but exactly 2. The value 2 doesn't change according to where the respondent lives nor is it influenced by the response to a previous question. Numbers have no context. Consider the word “duck”. The first meaning is a bird, more specifically a waterfowl, of the family Anatidae. The second meaning is to avoid something, to “duck for cover”. The third means the failure of a batsman to score at cricket during their innings - “to be out for a duck”. This property of words to have many meanings is called polysemy. The meaning of the word “duck” is related to the context it is used in whereas 2 means 1+1 or 3-1, it doesn't have any other meaning unless you get fancy and change the base. The fourth meaning of the word "duck" is as a term of endearment in parts of the United Kingdom, if someone calls you "duck" they're trying to be nice to you, they are not telling you to avoid something. To differentiate “duck” the verb as opposed to “duck” the noun requires some sophistication in the analysis of text. Finally, the fifth meaning of the word "duck" returns to an aquatic theme, "duck" can also mean to push someone underwater. I think there may be even more meanings ascribed to the word "duck" than I have listed. There is also the fact that some words are only used by some people. If I were to say to you “my daughter is mardy” unless you were from the more northern parts of the United Kingdom you probably would not know what I was talking about. “Mardy” is colloquial for bad tempered and complaining. It can be a noun too, to have a “mardy” is to have a temper tantrum. Groups of people can often use very specialized language that outsiders don't understand. In terms of emotional response, according to your beliefs, the word “abortion” can invoke radically different emotions. Language is fuzzy, and to seek precision when performing text analysis is a risky strategy because of concepts such as idiom and polsemy. Words can have a radically different meaning when used in different types of text such as social media, open ends or email. We all adapt the meaning of words according to a myriad of variables.

(3) What lies beneath.

We know that with numeric data there can be underlying patterns, we can use techniques such as Cluster Analysis or Factor Analysis to understand the hidden structure of numeric data. It's tempting to treat text in the same way, to force it into existing analytical frameworks and pretend it's numeric. It's a false approach. There are techniques such as Latent Dirichlet Allocation that will provide us with the underlying structure of text data, but it isn't remotely similar to the analyses used on numeric data. Traditional data analysis tools can make assumptions about the data they are used upon which simply don't hold up for text. You have to use the right tools for the right type of data. If you see any easy way to use traditional numeric analyses on text, then it probably isn't working the way you think it is.

Text analysis is still a relatively new field and it is important to understand that text isn't numeric data. New techniques are available that can reveal hidden patterns in text or show difference between text usage. These techniques are generally not the same as analyses used on numeric data and that's a good thing.

A mardy duck might not be what you think it is.


The Nature of Artifical Intelligence

What is Artificial Intelligence ? There seems to be a growing discussion about what Artificial Intelligence (AI) is. One thing we do know is that it is artificial, usually this is thought of in terms of computer based. The intelligence part is a bit more amorphous. What is intelligent behavior ? If we can identify that we may get a bit closer to understanding what AI is.

What IQ tests measure is not intelligence, it’s some sort of rough test of abilities that may or may not be useful in life. There is no help from psychometrics in defining AI. We need a more fundamental definition of what intelligent behavior is.

In the last century in the 1950’s and onwards a theory about the development of children’s cognitive skills became prominent. The author of this theory was Jean Piaget. Many people referred to him as a psychologist, but that is not what he would call himself. Piaget called himself a “genetic epistemologist”, which is a complicated way of saying that he studied the development of knowledge. Piaget was trained as a biologist and in the early part of his career he studied water snails in a lake in Switzerland. He then turned to studying children, initially his own children. While much of Piaget's theory of children's cognitive development has long since been discarded there are fundamentals elements of his approach that can help us understand the concept of Artificial Intelligence. Piaget retained his roots in biology; he was interested in all forms of intelligent behavior be it in a plant, child or a water snail.

For Piaget“the nature of intelligence is adaptation”. When he studied children he developed theories that stressed the constant change and adaptation of their internal mental structures that encoded knowledge. Piaget’s theory had the concept that mental structures need to maintain a certain state of stability. Thus they needed to adapt to new information and evolve in complexity to maintain stability. These knowledge structures allowed children to react to new situations by using their previous knowledge of the world. They could adapt mentally to new experiences.

Piaget's idea of adaption of knowledge structures was not restricted to just humans. He saw all animals as having “biological knowledge” of their environment. This enables them to adapt to changes in their environment. This idea is elucidated in a conversation with a French journalist, Jean-Claude Bringuier.

Piaget: I am convinced there is no sort of boundary between the living and the mental or between the biological and the psychological. From the moment an organism takes account of a previous experience and adapts to a new situation, that very much resembles psychology.

Bringuier: For instance, when sunflowers turn towards the sun, that's psychology ?

Piaget: I think, in fact, it is behaviour.

This gives us a definition of intelligence that we can apply to AI. For something to be defined as intelligent it must adapt. This implies that the intelligence must be constantly active.

AI is not analytical, it is adaptive. So a system that monitors advert placing based on CPM and changes advert placement according to results is intelligent, it is an AI. Using a deep learning algorithm to identify the content of images is not AI, it's analytical. The deep learning algorithm may have been inspired by neural network theory, but that does not make it AI. AI is a dynamic behavior not a static process.

Using this approach we can see that an AI is constantly learning and adapting to experience. In consequence it seems that much of what we say is “AI” is analytics, not intelligence. There is nothing wrong with analytics, but it isn't AI.

We should adapt our definition of AI.