Skip to main content

Clouds, clouds, and more clouds

Clouds, clouds, and more clouds
There are at least eleven kinds of clouds: cirrus, cirrocumulus, cirrostratus, altocumulus, altostratus, cumulonimbus, cumulus, nimbostratus, stratocumulus, small Cu, and stratus. But this article is not about those kinds of clouds. Of course there are other kinds of clouds, like iCloud, Google Cloud, Azure Cloud, Amazon Cloud, and the list goes on. But this article is not about those clouds either. This article is about text analytics.

Clouds and Text Analytics


The picture above was generated by R as a Word Cloud. It is equivalent to a frequency distribution (see Figure 1), where the size of the characters comprising a word corresponds to its frequency count, so that the word “icloud” occurs many times in the text, while the word “people” (inside the “d” in “icloud”) occurs less frequently. In Figure 1, “speed” occurs most frequently and is a key concept in flight dynamics, and in corresponding models.

Figure 1. Frequency distribution of our set of texts

Flashback


Many people believe that text analytics (defined below) is rather new to the scene. On the contrary, I was performing text analytics in 1997, or 20 years ago! I did it as part of my Doctoral research, which presented a new learning theory for mathematics, particularly for calculus: “How Students Make Meaning in a Calculus Course.”

Text Analytics


Text analytics is the process of deriving high-quality information from unstructured text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. It is often used to measure customer opinions, product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision making.

Cognitive Linguistics


Cognitive linguistics (CL) refers to a school of thought within general linguistics that interprets language in terms of the concepts, which underlie its forms. These concepts are sometimes universal, sometimes specific to a particular tongue. CL presents a forum for linguistic research of all kinds on the interaction between cognition and language.
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In CL they are used for the same reason, except how the rules correspond to the concepts are examined.
In my doctoral research, the corpus was comprised student journals, transcribed interviews (individual and focus groups), exam essays, and anything else that was recorded by students pertaining to calculus. The conceptual backdrop was Symbolic Interactionism—calculus is rich in symbols—but that’s a story for another day.
Back then, I used the Non-numerical Unstructured Data Indexing, Searching, and Theorizing (QSR NUD•IST) computer software to perform the text analytics. It was cutting edge software for that time, but if only I had R.

R and Text Analytics


As this is not a tutorial on how to perform text analytics with R, I will spare you a lot of code and address capabilities instead. To generate the Word Clouds in this article I used several libraries, not all of which are strictly for text analytics: wordcloud2, yaml, NLP, tm, SnowballC, and ggplot2. I will only discuss text analytics packages here.

Text Mining (tm) Package


The tm or Text Mining package in R has 60 functions for performing text analytics, ranging from reading text documents and compiling a corpus to cleaning the text and pulling out meaningful information. By cleaning, I mean removing punctuation, common words that do not at value (an, the, etc.), removing caps, and identifying word combinations or phrases (word cloud, marginal profit, etc.). For example, find frequent terms (FindFreqTerms) in a document-term or term-document matrix will count frequent terms using lower and upper bound count values, including infinity (Inf) for no upper bound.

Natural Language Processing (NLP)


NLP provides 23 functions fro performing text analytics, most of which are either for text annotation and token tagging. For example, tokenization is the process of breaking a text string up into words, phrases, symbols, or other meaningful elements called tokens. This can be accomplished by returning the sequence of tokens, or the corresponding spans (character start and end positions).

koRpus


koRpus is a diverse collection of 88 functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). For example although automatic POS tagging, lemmatization and hyphenation are remarkably accurate, the algorithms do usually produce some errors. If you want to correct for these flaws, correct.tag can be of help, because it might prevent you from introducing new errors.

RTextTools


RTextTools is a machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes 23 functions and nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation.

Other R Packages


Other packages include (these are only a few I thought were most useful):

textir is a suite of tools for text and sentiment mining.

textcat provides support for n-gram based text categorization.

corpora offers utility functions for the statistical analysis of corpus frequency data.

tidytext provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools.

boilerpipeR helps with the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library.

tau contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.

Rstem (available from Omegahat) is an alternative interface to a C version of Porter's word stemming algorithm.

SnowballC provides exactly the same API as Rstem, but uses a slightly different design of the C libstemmer library from the Snowball project. It also supports two more languages.

Example


As a short example using a PC, I saved the corpus folder to my C: drive using the following code chunk:

cname <- file.path("C:/Users/Strickland/Documents/", "texts")  

dir(cname)  

docs <- Corpus(DirSource(cname))  

summary(docs)

inspect(docs)

Within the corpus are two of my textbooks, converted to text files (Mechanics_of_Flight.txt and Data_Analytics.txt). There are 1,794,798 characters in the corpus. Using the tools mentioned above I removed punctuation, removed special characters, removed unnecessary words, and converted to uppercase to lowercase. I also combined words that should stay together, like data analytics" as "data_analytics" and "predictive models" as "predictive_models", for example. I also removed common word endings (e.g., “ing”, “es”, “s”), and stripped unnecessary whitespace from the documents.

Eventually, I told R to treat my preprocessed documents as text documents, and proceeded to create a document term matrix, which enabled me to organizes the terms by their frequency.
Next, I check some of the frequency counts. There are a lot of terms, so I just check out some of the most and least frequently occurring words, as well as check out the frequency of frequencies and correlations. Finally, I put the most frequent occurring words (frequencies of 150 or more) in a data frame and plotted the frequency distribution (see Figure 1) and word cloud, as shown in Figure 2.
Figure 2. Word cloud generated by R using wordcloud2

Conclusion


QSR NUD•IST was an expensive piece of software (even by 1997 standards) and did not handle large corpuses like R. On the other hand, R with at least 40 packages for performing various aspects of text mining is open-source (translated "free"), and I have not found a size limitation on the corpus, to date.

Comments

Post a Comment

Popular posts from this blog

Time Series Analysis using iPython

Time Series Analysis using iPython
In this example, we will examine ARMA and ARIMA models with Python using the Statsmodels package. This package can be downloaded at http://statsmodels.sourceforge.net/stable/index.html. Autogressive Moving-Average Processes (ARMA) and Auto-Regressive Integrated Moving Average (ARIMA) can be called from the tsa (Time Series) module from the Statamodels package.
Note: I am not as expert in time-series analysis as I am in other areas of Analytics, so if you find errors I would be happy to know about them and correct them. Introduction ARIMA models are, in theory, the most general class of models for forecasting a time series, which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). A random variable that is a time series is stationary if its statistical properties are all constant over time. A stationary series has no trend, its variations aro…

Neural Networks using R

Neural Networks using RByonMay 13, 2015
The intent of this article is not to tell you everything you wanted to know about artificial neural networks (ANN) and were afraid to ask. For that you’ll have to ask someone else. Here I only intend to tell you how you might use R to implement an ANN model. One thing I will say is that I rarely use an ANN. I have found them to work best in an ensemble model (using averaging) with logistics regression models. Using neuralnetneuralnet depends on two other packages: grid and MASS (Venables and Ripley, 2002). It is used is primarily with functions dealing with regression analyses like linear models (lm) and general linear models (glm). As essential arguments, we must specify a formula in terms of response variables ~ sum of covariates and a data set containing covariates and response variables. Default values are defined for all other parameters (see next subsection). We use the data set infert that is provided by the package dat…