Clouds, clouds, and more clouds
There are at least eleven kinds of clouds: cirrus, cirrocumulus, cirrostratus, altocumulus, altostratus, cumulonimbus, cumulus, nimbostratus, stratocumulus, small Cu, and stratus. But this article is not about those kinds of clouds. Of course there are other kinds of clouds, like iCloud, Google Cloud, Azure Cloud, Amazon Cloud, and the list goes on. But this article is not about those clouds either. This article is about text analytics.
Clouds and Text Analytics
The picture above was generated by R as a Word Cloud. It is equivalent to a frequency distribution (see Figure 1), where the size of the characters comprising a word corresponds to its frequency count, so that the word “icloud” occurs many times in the text, while the word “people” (inside the “d” in “icloud”) occurs less frequently. In Figure 1, “speed” occurs most frequently and is a key concept in flight dynamics, and in corresponding models.
Figure 1. Frequency distribution of our set of texts
Many people believe that text analytics (defined below) is rather new to the scene. On the contrary, I was performing text analytics in 1997, or 20 years ago! I did it as part of my Doctoral research, which presented a new learning theory for mathematics, particularly for calculus: “How Students Make Meaning in a Calculus Course.”
Text analytics is the process of deriving high-quality information from unstructured text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. It is often used to measure customer opinions, product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision making.
Cognitive linguistics (CL) refers to a school of thought within general linguistics that interprets language in terms of the concepts, which underlie its forms. These concepts are sometimes universal, sometimes specific to a particular tongue. CL presents a forum for linguistic research of all kinds on the interaction between cognition and language.
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In CL they are used for the same reason, except how the rules correspond to the concepts are examined.
In my doctoral research, the corpus was comprised student journals, transcribed interviews (individual and focus groups), exam essays, and anything else that was recorded by students pertaining to calculus. The conceptual backdrop was Symbolic Interactionism—calculus is rich in symbols—but that’s a story for another day.
Back then, I used the Non-numerical Unstructured Data Indexing, Searching, and Theorizing (QSR NUD•IST) computer software to perform the text analytics. It was cutting edge software for that time, but if only I had R.
R and Text Analytics
As this is not a tutorial on how to perform text analytics with R, I will spare you a lot of code and address capabilities instead. To generate the Word Clouds in this article I used several libraries, not all of which are strictly for text analytics: wordcloud2, yaml, NLP, tm, SnowballC, and ggplot2. I will only discuss text analytics packages here.
Text Mining (tm) Package
The tm or Text Mining package in R has 60 functions for performing text analytics, ranging from reading text documents and compiling a corpus to cleaning the text and pulling out meaningful information. By cleaning, I mean removing punctuation, common words that do not at value (an, the, etc.), removing caps, and identifying word combinations or phrases (word cloud, marginal profit, etc.). For example, find frequent terms (FindFreqTerms) in a document-term or term-document matrix will count frequent terms using lower and upper bound count values, including infinity (Inf) for no upper bound.
Natural Language Processing (NLP)
NLP provides 23 functions fro performing text analytics, most of which are either for text annotation and token tagging. For example, tokenization is the process of breaking a text string up into words, phrases, symbols, or other meaningful elements called tokens. This can be accomplished by returning the sequence of tokens, or the corresponding spans (character start and end positions).
koRpus is a diverse collection of 88 functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). For example although automatic POS tagging, lemmatization and hyphenation are remarkably accurate, the algorithms do usually produce some errors. If you want to correct for these flaws, correct.tag can be of help, because it might prevent you from introducing new errors.
RTextTools is a machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes 23 functions and nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation.
Other R Packages
Other packages include (these are only a few I thought were most useful):
textir is a suite of tools for text and sentiment mining.
textcat provides support for n-gram based text categorization.
corpora offers utility functions for the statistical analysis of corpus frequency data.
tidytext provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools.
boilerpipeR helps with the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library.
tau contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
Rstem (available from Omegahat) is an alternative interface to a C version of Porter's word stemming algorithm.
SnowballC provides exactly the same API as Rstem, but uses a slightly different design of the C libstemmer library from the Snowball project. It also supports two more languages.
As a short example using a PC, I saved the corpus folder to my C: drive using the following code chunk:
cname <- file.path("C:/Users/Strickland/Documents/", "texts") dir(cname) docs <- Corpus(DirSource(cname)) summary(docs) inspect(docs)
Within the corpus are two of my textbooks, converted to text files (Mechanics_of_Flight.txt and Data_Analytics.txt). There are 1,794,798 characters in the corpus. Using the tools mentioned above I removed punctuation, removed special characters, removed unnecessary words, and converted to uppercase to lowercase. I also combined words that should stay together, like data analytics" as "data_analytics" and "predictive models" as "predictive_models", for example. I also removed common word endings (e.g., “ing”, “es”, “s”), and stripped unnecessary whitespace from the documents.
Eventually, I told R to treat my preprocessed documents as text documents, and proceeded to create a document term matrix, which enabled me to organizes the terms by their frequency.
Next, I check some of the frequency counts. There are a lot of terms, so I just check out some of the most and least frequently occurring words, as well as check out the frequency of frequencies and correlations. Finally, I put the most frequent occurring words (frequencies of 150 or more) in a data frame and plotted the frequency distribution (see Figure 1) and word cloud, as shown in Figure 2.
Figure 2. Word cloud generated by R using wordcloud2
QSR NUD•IST was an expensive piece of software (even by 1997 standards) and did not handle large corpuses like R. On the other hand, R with at least 40 packages for performing various aspects of text mining is open-source (translated "free"), and I have not found a size limitation on the corpus, to date.
Informative post, thanks for taking time to share this page.ReplyDelete
Azure Training center in Chennai | AWS Training in Chennai | AWS Training | AWS course in Chennai | DevOps certification Chennai | DevOps Training in Chennai
Great Article Information Security ProjectsReplyDelete
Cloud Security Projects
The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training
We can share this blog to anyone related to this topic. Thanks for sharing.ReplyDelete
AWS Training In Chennai
AWS Online Training
AWS Training in Bangalore
The blog which you have posted is more impressive... thanks for sharing with us...ReplyDelete
Uses For Selenium
What is Selenium Used For
Wonderful Blog post!!! I am more impressed with your data.ReplyDelete
What is AWS Used For?
More impressive blog!!! Thanks for shared with us.... waiting for you upcoming data.ReplyDelete
Why Selenium is important?
software testing selenium