Clouds, clouds, and more clouds
There are at least eleven kinds of clouds: cirrus, cirrocumulus, cirrostratus, altocumulus, altostratus, cumulonimbus, cumulus, nimbostratus, stratocumulus, small Cu, and stratus. But this article is not about those kinds of clouds. Of course there are other kinds of clouds, like iCloud, Google Cloud, Azure Cloud, Amazon Cloud, and the list goes on. But this article is not about those clouds either. This article is about text analytics.
Clouds and Text Analytics
The picture above was generated by R as a Word Cloud. It is equivalent to a frequency distribution (see Figure 1), where the size of the characters comprising a word corresponds to its frequency count, so that the word “icloud” occurs many times in the text, while the word “people” (inside the “d” in “icloud”) occurs less frequently. In Figure 1, “speed” occurs most frequently and is a key concept in flight dynamics, and in corresponding models.
Figure 1. Frequency distribution of our set of texts
Flashback
Many people believe that text analytics (defined below) is rather new to the scene. On the contrary, I was performing text analytics in 1997, or 20 years ago! I did it as part of my Doctoral research, which presented a new learning theory for mathematics, particularly for calculus: “How Students Make Meaning in a Calculus Course.”
Text Analytics
Text analytics is the process of deriving high-quality information from unstructured text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. It is often used to measure customer opinions, product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision making.
Cognitive Linguistics
Cognitive linguistics (CL) refers to a school of thought within general linguistics that interprets language in terms of the concepts, which underlie its forms. These concepts are sometimes universal, sometimes specific to a particular tongue. CL presents a forum for linguistic research of all kinds on the interaction between cognition and language.
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In CL they are used for the same reason, except how the rules correspond to the concepts are examined.
In my doctoral research, the corpus was comprised student journals, transcribed interviews (individual and focus groups), exam essays, and anything else that was recorded by students pertaining to calculus. The conceptual backdrop was Symbolic Interactionism—calculus is rich in symbols—but that’s a story for another day.
Back then, I used the Non-numerical Unstructured Data Indexing, Searching, and Theorizing (QSR NUD•IST) computer software to perform the text analytics. It was cutting edge software for that time, but if only I had R.
R and Text Analytics
As this is not a tutorial on how to perform text analytics with R, I will spare you a lot of code and address capabilities instead. To generate the Word Clouds in this article I used several libraries, not all of which are strictly for text analytics: wordcloud2, yaml, NLP, tm, SnowballC, and ggplot2. I will only discuss text analytics packages here.
Text Mining (tm) Package
The tm or Text Mining package in R has 60 functions for performing text analytics, ranging from reading text documents and compiling a corpus to cleaning the text and pulling out meaningful information. By cleaning, I mean removing punctuation, common words that do not at value (an, the, etc.), removing caps, and identifying word combinations or phrases (word cloud, marginal profit, etc.). For example, find frequent terms (FindFreqTerms) in a document-term or term-document matrix will count frequent terms using lower and upper bound count values, including infinity (Inf) for no upper bound.
Natural Language Processing (NLP)
NLP provides 23 functions fro performing text analytics, most of which are either for text annotation and token tagging. For example, tokenization is the process of breaking a text string up into words, phrases, symbols, or other meaningful elements called tokens. This can be accomplished by returning the sequence of tokens, or the corresponding spans (character start and end positions).
koRpus
koRpus is a diverse collection of 88 functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall). For example although automatic POS tagging, lemmatization and hyphenation are remarkably accurate, the algorithms do usually produce some errors. If you want to correct for these flaws, correct.tag can be of help, because it might prevent you from introducing new errors.
RTextTools
RTextTools is a machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes 23 functions and nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation.
Other R Packages
Other packages include (these are only a few I thought were most useful):
textir is a suite of tools for text and sentiment mining.
textcat provides support for n-gram based text categorization.
corpora offers utility functions for the statistical analysis of corpus frequency data.
tidytext provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools.
boilerpipeR helps with the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library.
tau contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
Rstem (available from Omegahat) is an alternative interface to a C version of Porter's word stemming algorithm.
SnowballC provides exactly the same API as Rstem, but uses a slightly different design of the C libstemmer library from the Snowball project. It also supports two more languages.
Example
As a short example using a PC, I saved the corpus folder to my C: drive using the following code chunk:
cname <- file.path("C:/Users/Strickland/Documents/", "texts")
dir(cname)
docs <- Corpus(DirSource(cname))
summary(docs)
inspect(docs)
Within the corpus are two of my textbooks, converted to text files (Mechanics_of_Flight.txt and Data_Analytics.txt). There are 1,794,798 characters in the corpus. Using the tools mentioned above I removed punctuation, removed special characters, removed unnecessary words, and converted to uppercase to lowercase. I also combined words that should stay together, like data analytics" as "data_analytics" and "predictive models" as "predictive_models", for example. I also removed common word endings (e.g., “ing”, “es”, “s”), and stripped unnecessary whitespace from the documents.
Eventually, I told R to treat my preprocessed documents as text documents, and proceeded to create a document term matrix, which enabled me to organizes the terms by their frequency.
Next, I check some of the frequency counts. There are a lot of terms, so I just check out some of the most and least frequently occurring words, as well as check out the frequency of frequencies and correlations. Finally, I put the most frequent occurring words (frequencies of 150 or more) in a data frame and plotted the frequency distribution (see Figure 1) and word cloud, as shown in Figure 2.
Figure 2. Word cloud generated by R using wordcloud2
Conclusion
QSR NUD•IST was an expensive piece of software (even by 1997 standards) and did not handle large corpuses like R. On the other hand, R with at least 40 packages for performing various aspects of text mining is open-source (translated "free"), and I have not found a size limitation on the corpus, to date.
Informative post, thanks for taking time to share this page.
ReplyDeleteAzure Training center in Chennai | AWS Training in Chennai | AWS Training | AWS course in Chennai | DevOps certification Chennai | DevOps Training in Chennai
Cloud Services and Text Analytics
DeleteCloud services provide scalable and flexible platforms for performing text analytics, enabling organizations to process and analyze large volumes of text data without needing to maintain their own infrastructure. Some popular cloud services for text analytics include:
Amazon Web Services (AWS):
Amazon Comprehend: A natural language processing (NLP) service that uses machine learning to find insights and relationships in a text. It can identify the language, extract key phrases, places, people, brands, or events, understand sentiment, and more.
AWS Lambda: A serverless compute service that can run text processing scripts in response to triggers, such as new text data uploaded to an S3 bucket.
Amazon SageMaker: A fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly.
Google Cloud Platform (GCP):
Google Cloud Natural Language API: Offers a suite of tools for text analysis, including sentiment analysis, entity recognition, syntax analysis, and text classification.
BigQuery: A fully-managed, serverless data warehouse that enables fast SQL queries and can integrate with other text analytics tools.
Google AutoML Natural Language: A suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their needs.
Cloud Computing Projects for Final Year
Deep Learning Projects for Final Year
Machine Learning Projects for Final Year
Wonderful Blog post!!! I am more impressed with your data.
ReplyDeleteWhy AWS?
What is AWS Used For?
More impressive blog!!! Thanks for shared with us.... waiting for you upcoming data.
ReplyDeleteWhy Selenium is important?
software testing selenium