The words “text analysis” have the same ring to them as “data analysis”, obviously it's the “analysis” word. But it is a significant mistake to think that text analysis is like data analysis. Text is a form of data but it isn't the same as traditional numeric data.
Here are 3 reasons why text is different from traditional numeric data.
(1) We need to throw away some text data to get a better analysis of the remaining text data.
Language has a lot of different types of words. Common words such as “the” or “at” or “this” help provide a narrative structure when we speak or write, yet they can be impediments to the analysis of text. For instance if we want to analyse common phrases in body of text, so called “ngrams”, we won't find much value in the trigram “at this the” or “I go when”. Yet these types of trigrams will appear as very common when you analyze text. That's why there is the concept of “stop” words, that is words that we remove from the text before we analyze it to make sure more meaningful combinations of words can be seen. What we may want to see are phrases like “local grocery store” but unless we remove stop words these phrases are a long way down the list of trigrams. Stop words are a balancing act because you may remove meaningful information with a too long stop word list. You may also want to tailor the stop words to a particular type of text you are analyzing for instance customer reviews versus film reviews. A final consideration is that common words do contain information, they are called “functional words” and give an insight into the narrative structure of the text. The word “because” is a good example of a word that may be used as a stop word yet obviously shows a certain sort of narrative structure in the text.
(2) One means one but “duck” can mean many different things.
In a survey if a respondent enters a value in response to a question, for example how many cars they own, it has a single meaning. If they respond with “2” to the number of cars they own it means they have 2 cars, not 3 or 4, not 1 or 0 but exactly 2. The value 2 doesn't change according to where the respondent lives nor is it influenced by the response to a previous question. Numbers have no context. Consider the word “duck”. The first meaning is a bird, more specifically a waterfowl, of the family Anatidae. The second meaning is to avoid something, to “duck for cover”. The third means the failure of a batsman to score at cricket during their innings - “to be out for a duck”. This property of words to have many meanings is called polysemy. The meaning of the word “duck” is related to the context it is used in whereas 2 means 1+1 or 3-1, it doesn't have any other meaning unless you get fancy and change the base. The fourth meaning of the word "duck" is as a term of endearment in parts of the United Kingdom, if someone calls you "duck" they're trying to be nice to you, they are not telling you to avoid something. To differentiate “duck” the verb as opposed to “duck” the noun requires some sophistication in the analysis of text. Finally, the fifth meaning of the word "duck" returns to an aquatic theme, "duck" can also mean to push someone underwater. I think there may be even more meanings ascribed to the word "duck" than I have listed. There is also the fact that some words are only used by some people. If I were to say to you “my daughter is mardy” unless you were from the more northern parts of the United Kingdom you probably would not know what I was talking about. “Mardy” is colloquial for bad tempered and complaining. It can be a noun too, to have a “mardy” is to have a temper tantrum. Groups of people can often use very specialized language that outsiders don't understand. In terms of emotional response, according to your beliefs, the word “abortion” can invoke radically different emotions. Language is fuzzy, and to seek precision when performing text analysis is a risky strategy because of concepts such as idiom and polsemy. Words can have a radically different meaning when used in different types of text such as social media, open ends or email. We all adapt the meaning of words according to a myriad of variables.
(3) What lies beneath.
We know that with numeric data there can be underlying patterns, we can use techniques such as Cluster Analysis or Factor Analysis to understand the hidden structure of numeric data. It's tempting to treat text in the same way, to force it into existing analytical frameworks and pretend it's numeric. It's a false approach. There are techniques such as Latent Dirichlet Allocation that will provide us with the underlying structure of text data, but it isn't remotely similar to the analyses used on numeric data. Traditional data analysis tools can make assumptions about the data they are used upon which simply don't hold up for text. You have to use the right tools for the right type of data. If you see any easy way to use traditional numeric analyses on text, then it probably isn't working the way you think it is.
Text analysis is still a relatively new field and it is important to understand that text isn't numeric data. New techniques are available that can reveal hidden patterns in text or show difference between text usage. These techniques are generally not the same as analyses used on numeric data and that's a good thing.
A mardy duck might not be what you think it is.