For this exploratory analysis, we will parse the en_US.blogs.txt text document, analyze the frequencies of different N-grams, and state how we will accomplish the goals of the Data Science Capstone project.
The data that we will use for this analysis can be found at heliohost.org, and was made available through the John Hopkins Data Science Capstone project.
The system will read the text and create a unigram, bigram, and trigram for our analysis. In the process of reading the text, the text is cleaned (remove punctuation except for the single quote, make all characters lowercase, remove unknown characters, remove numbers, and remove profanity).
Pseudo mapper and reducer functions are used to parse the cleaned data in case the resources on the computer are limited. The mapper simply creates the N-gram and the reducer sorts and computes the frequencies.
The text to parse should be in the following directory under your current working directory: “data/en_US”.
We will start by performing some basic analysis on all three files (blogs, news, and tweets) under the directory, but first we need to load the paths.
library(stringdist)
library(dplyr)
library(tm)
library(RWeka)
blogsFilePath <- "data/en_US/en_US.blogs.txt"
newsFilePath <- "data/en_US/en_US.news.txt"
tweetsFilePath <- "data/en_US/en_US.twitter.txt"
cleanedFilePath <- "data/cleanedtext.txt"
uniMapperOutput <- "data/output/uniMapperOutput.txt"
biMapperOutput <- "data/output/biMapperOutput.txt"
triMapperOutput <- "data/output/triMapperOutput.txt"
quadMapperOutput <- "data/output/quadMapperOutput.txt"
unigramModel <- "data/output/unigram.txt"
bigramModel <- "data/output/bigram.txt"
trigramModel <- "data/output/trigram.txt"
quadgramModel <- "data/output/quadgram.txt"
profanityPath <- "data/lexicons/profanityEn_US.txt"
profanity <- read.csv(profanityPath, header=FALSE, stringsAsFactor=FALSE, strip.white=TRUE)
Now let’s get some basic stats about the files.
Blog lines and number of words:
blogs <- readLines(blogsFilePath)
length(blogs) # Number of lines
## [1] 899288
wordsPerRow <- sapply(gregexpr("[[:alpha:]]+", blogs), function(x) sum(x > 0))
sum(wordsPerRow) # Number of words (not cleaned)
## [1] 38171210
News lines and number of words:
news <- readLines(newsFilePath)
length(news) # Number of lines
## [1] 77259
wordsPerRow <- sapply(gregexpr("[[:alpha:]]+", news), function(x) sum(x > 0))
sum(wordsPerRow) # Number of words (not cleaned)
## [1] 2676409
Tweets lines and number of words:
tweets <- readLines(tweetsFilePath)
length(tweets) # Number of lines
## [1] 2360148
wordsPerRow <- sapply(gregexpr("[[:alpha:]]+", tweets), function(x) sum(x > 0))
sum(wordsPerRow) # Number of words (not cleaned)
## [1] 30657929
The following code chunk cleans the data and writes the cleaned text to a file. Since this process takes a long time, the text is written to a file for latter analysis.
corp <- Corpus(VectorSource(blogs))
rm(blogs)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeWords, profanity[,1])
text <- unlist(sapply(corp, `[`, "content"))
rm(corp)
fileConn<-file(cleanedFilePath)
writeLines(text, fileConn)
close(fileConn)
rm(text)
The mapper and reducer functions are used to extract the different N-grams.
mapper <- function(inputFile, outputFile, gramLength){
input <- file(inputFile, open="r")
output <- file(outputFile, open="w")
delim <- " \\t\\r\\n.!?,;\"()"
writeLines(paste("token", "count", sep="\t"), output)
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
tokens <- NGramTokenizer(line, Weka_control(min=gramLength,max=gramLength, delimiters = delim))
writeLines(tokens, output)
rm(tokens)
}
close(input)
close(output)
}
reducer <- function(inputFile){
input <- read.csv(inputFile, sep="\t")
input$count <- rep(1, times=length(input[,1]))
result <- input %>%
group_by(token) %>%
select(token, count) %>%
summarize(count = sum(count)) %>%
arrange(desc(count))
rm(input)
result
}
The following code chunk saves the N-grams to a file.
ptm <- proc.time()
mapper(cleanedFilePath, uniMapperOutput, 1)
result <- reducer(uniMapperOutput)
write.table(result, file=unigramModel, sep="\t")
proc.time() - ptm
ptm <- proc.time()
mapper(cleanedFilePath, biMapperOutput, 2)
result <- reducer(biMapperOutput)
write.table(result, file=bigramModel, sep="\t")
proc.time() - ptm
ptm <- proc.time()
mapper(cleanedFilePath, triMapperOutput, 3)
result <- reducer(triMapperOutput)
write.table(result, file=trigramModel, sep="\t")
proc.time() - ptm
ptm <- proc.time()
mapper(cleanedFilePath, quadMapperOutput, 4)
result <- reducer(quadMapperOutput)
write.table(result, file=quadgramModel, sep="\t")
proc.time() - ptm
rm(result)
After the N-grams are created, we can analyze the frequencies to help detect a cutoff for sampling from the text in order to make the creation of the model efficient.
For exploratory analysis, we display the count of each n-gram and then plot the frequency distribution of each n-gram.
Load the unigram and get the count of the number of unique tokens.
unigram <- read.table(unigramModel, sep="\t", stringsAsFactors = FALSE)
unigramUnique <- length(unigram[,1])
unigramUnique
## [1] 479900
Calculate the percent of frequencies of each token
unigramCount <- sum(unigram$count)
unigram$percent <- unigram$count / unigramCount
Plot the log distribution to get the shape of the distribution of frequencies.
unigram$log <- log(unigram$count)
plot(unigram$log,dlnorm(unigram$log,mean(unigram$log),sd(unigram$log)),type="l",xlab="Unigram Log Values",ylab="Log of Frequencies")
lines(density(unigram$log),col="red")
The following code chunk shows how much of the corpus is needed for 90% of coverage of the unigram.
findCoverage <- function(ngramTable, threshold){
uniqueGramCount <- 0
for(i in 1:length(ngramTable[,1])){
if(sum(ngramTable[1:i,"percent"]) < threshold){
uniqueGramCount <- i
}
else{
break
}
}
uniqueGramCount
}
uniqueGramCount <- findCoverage(unigram, 0.9)
uniqueGramCount/length(unigram[,1])
## [1] 0.01533444
The most frequent 7359 tokens contain 90% coverage of the unigram. To get an idea of how much of the corpus we need to parse to get 90% coverage, we could unsort the unigram and then find out how many tokens cover 90%. To reduce processing time and to exclude infrequent unique tokens, we can remove the grams that contain a low frequency.
unigram <- unigram[unigram$count > 20,]
unigram <- unigram[sample(nrow(unigram)),]
uniqueGramCount <- findCoverage(unigram, 0.9)
uniqueGramCount/length(unigram[,1])
## [1] 0.8899181
suggesting that when we get 35546 unique tokens, then we have enough coverage to make the Next Word efficient if using just a unigram.
Load the unigram and get the count of the number of unique tokens.
bigram <- read.table(bigramModel, sep="\t", stringsAsFactors = FALSE)
bigramUnique <- length(bigram[,1])
bigramUnique
## [1] 6614742
Calculate the percent of frequencies of each token
bigramCount <- sum(bigram$count)
bigram$percent <- bigram$count / bigramCount
Plot the log distribution to get the shape of the distribution of frequencies.
bigram$log <- log(bigram$count)
plot(bigram$log,dlnorm(bigram$log,mean(bigram$log),sd(bigram$log)),type="l",xlab="Bigram Log Values",ylab="Log of Frequencies")
lines(density(bigram$log),col="red")
The following code chunk shows how much of the corpus is needed for 90% of coverage of the bigram.
bigram <- bigram[bigram$count > 10,]
uniqueGramCount <- findCoverage(bigram, 0.9)
uniqueGramCount/length(bigram[,1])
## [1] 1
The most frequent 298497 tokens contain 90% coverage of the bigram. To get an idea of how much of the corpus we need to parse to get 90% coverage, we could unsort the bigram and then find out how many tokens cover 90%. To reduce processing time and to exclude infrequent unique tokens, we can remove the grams that contain a low frequency.
bigram <- bigram[sample(nrow(bigram)),]
uniqueGramCount <- findCoverage(bigram, 0.9)
uniqueGramCount/length(bigram[,1])
## [1] 1
suggesting that when we get 298497 unique tokens, then we have enough coverage to make the Next Word efficient if using just a bigram.
The model will use a table of N-grams (rows) and next word (column) with probabilities to predict the next three words. The probabilities are calculated with a conditional probability of what is the probability of the next word being x given that y is seen.
Since the system will predict the next word as a list of 3 most likely words, perplexity will measure how well the system is working.
Justin Nafe July 20th, 2015
Posted In: Exploratory Analysis