Project 2 - ST 558 - CM Heubusch 2
July 3, 2020
Introduction
Purpose of Analysis
The Online News Popularity data set describes two years’ worth of articles that were published on Mashable. The purpose of this analysis is to create linear and ensemble models by which we can best predict the number of shares an article on the website will receive. Mashable staff and other online publishers may be interested in this data, as it could help them determine what kind of content will be most popular.
Methods Used
For this project, I first examined the data with summary statistics, looking for correlations between the target variable of shares and the predictor variables. These helped determine which variables I wanted to experiment with when it came to my linear models. Though it was not explicitly part of the class material, I also created a separate model, based upon which variables
I then utilized the caret
package to create a random forest. Per the
class notes, the random forest method help us obtain a “more corrlated”
version of bagged tree predictions, ensuring that a hypothetical really
strong predictor doesn’t wind up in every bootstrap tree. With this
method, we might NOT use all predictors - instead, it’ll be “a random
subset of predictors for each bootstrap sample/tree fit”.
Describing the Data & Variables
In total, the data set consists of 39,644 observations, with 61 columns of different variables. In examining the variables, I was particularly intrigued by the following, simply based upon my limited knowledge of the media industry:
- n_tokens_title - “number of words in the title”
- n_tokens_content - “number of words in the content”
- num_hrefs - “number of links” in the article
- num_imgs - “number of images”
- num_videos - “number of videos”
- average_token_length - average word length in the article
- num_keywords - how many keywords are included in the metadata, an important factor for search engine optimization
- data_channel_is_* variables - six binary variables, indicating whether the observation is included in a particular channel. Each article appears to only be attributed to either one channel, or no channel/another channel (perhaps smaller) that was not accounted for with these binary variables
- global_rate_positive_words - “rate of positive words in the content,” or rather, the ratio of positive:total words in the article
- global_rate_negative_words - “rate of negative words in the content,” the complement to global_rate_positive_words
The weekday_is_ variables will be our means of creating seven separate reports. They are binary variables, so their value is either 0 (for No, not published that day of the week) or 1 (Yes, published that day). weekday_is_weekend is 1 if the article was published on either Saturday or Sunday.
Reading in the Dataset
newsData <- read.csv(file="/Users/christinemarieheubusch/Project-2/OnlineNewsPopularity/OnlineNewsPopularity.csv")
#Removing URL and timedelta columns, since these variables are non-predictive.
newsData <- newsData %>% select(-url, -timedelta)
Creating New Column Using Mutate
I was intrigued by the data_channel_is_ columns in the set, so I
decided to figure out a way to combine these columns into a single
channel column with the mutate()
function. Working off a
StackOverflow
example,
I used
dplyr::case_when
to assign values based upon the values within the from the six
data_channel_is_ columns.
I then realized that not every article was associated with a channel; so
I used the replace_na()
function from
the tidyr
package to replace NAs with “Other”. (This is an assumption
on my end - it’s possible that some articles are not assigned to any
channel.)
I finished this by converting the values of the channel column to factors, removing the URL column (since it will not be used in any calculations), and removed the old data_channel_is_ columns, for a new total of 55 columns for the dataset. I can now use this column for categorical variables.
newsData <- newsData %>%
mutate(channel = case_when(
data_channel_is_bus == 1 ~ "Business",
data_channel_is_entertainment == 1 ~ "Entertainment",
data_channel_is_lifestyle == 1 ~ "Lifestyle",
data_channel_is_socmed == 1 ~ "SocialMedia",
data_channel_is_tech == 1 ~ "Tech",
data_channel_is_world == 1 ~ "World"
))
newsData$channel <- replace_na(newsData$channel, "Other")
newsData$channel <- as.factor(newsData$channel) #Converting to factor
view(newsData)
#Original filtering for just Monday data. I did not manage to automate my reports, so I generated separate reports manually using this filter.
newsData <- newsData %>% filter(weekday_is_friday==1)
Per the project directions, I began by splitting the data, using
sample()
, with 70% of the data going to the training set (4,662
observations, newsDataTrain) and 30% going to the test set (1,999
observations, newsDataTest).
set.seed(789) #Setting seed to make it reproducible
train <- sample(1:nrow(newsData), size=nrow(newsData)*0.7)
test <- dplyr::setdiff(1:nrow(newsData), train)
newsDataTrain <- newsData[train,]
newsDataTest <- newsData[test,]
#view(newsDataTrain)
#view(newsDataTest)
Now let’s see if any of the variables that intrigued me should be included in my multiple linear regression.
Summarizations
Calculating Summary Statistics for Variables
I then used the summary()
function to calculate summary statistics for
each of the quantitative variables in the dataset, including Min, 1st
Quartile, Median, Mean, 3rd Quartile, and Max. For the one categorical
variable I created (channel), it has calculated the frequency of
each type of channel.
summary(newsData)
I further explored the data by looking at a correlations plot with the variables I had expressed interest in previously. I divided the variables into different chunks, to make the plots easier to view. I excluded weekday_is variables, as I felt they would only be used for the final creation of the reports, as opposed to being involved in the model. I used the code from our notes (Module 7 - Summarizing Data/Quantitative Summaries) for this purpose. ## Creating First Correlation Plot with First 10 Variables
newsCorrelation1 <- cor(newsDataTrain[,c(1:10, 59)])
corrplot(newsCorrelation1, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation1, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Creating Second Correlation Plot with Other Variables
newsCorrelation2 <- cor(newsDataTrain[,c(11:20, 59)])
corrplot(newsCorrelation2, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation2, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Creating Third Correlation Plot
newsCorrelation3 <- cor(newsDataTrain[,c(21:28, 59)]) #excluding "weekday_is variables"
corrplot(newsCorrelation3, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation3, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Creating Fourth Correlation Plot
newsCorrelation4 <- cor(newsDataTrain[,c(38:47, 59)]) #excluding "weekday_is" variables
corrplot(newsCorrelation4, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation4, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Creating Fifth Correlation Plot
newsCorrelation5 <- cor(newsDataTrain[,c(48:58, 59)]) #excluding "weekday_is" variables
corrplot(newsCorrelation5, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation5, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Creating Extra Correlation Plot
I also created an extra correlation plot, to examine some variables that I had taken an interest in, where I thought there could be relationships between the predictor variables. I thought these would be of interest, as perhaps the number of keywords in the metadata would affect likelihood of someone finding the article through SEO (and then sharing it); or maybe social media users would gravitate towards longer/shorter titles, longer/shorter articles, longer/short words, articles with more negative words, and/or articles that have more links, images, or videos.
newsCorrelation6 <- cor(newsDataTrain[,c("num_keywords", "n_tokens_title","n_tokens_content", "num_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "global_rate_positive_words", "global_rate_negative_words", "shares")])
corrplot(newsCorrelation6, type="upper", method="number", tl.pos="lt", number.cex=0.5)
corrplot(newsCorrelation6, type="lower", add=TRUE, tl.pos="n", number.cex=0.5)
Unfortunately,looking at these six plots, I do not see a substantial relationship between any one variable and the target variable of shares - those low values all appear faded on the correlation plots, representing weak relationships. However, with my final plot I do note some correlation between other variables, suggesting that there may be interactions between them. For example, the highest correlation coefficient value is r=0.45, when looking at num_hrefs and n_tokens_content. This does seem to make practical sense; the more words you have, the more likely you’ll need to include more links. Perhaps less intuitive is the relationship between num_imgs and num_hrefs, with the second-highest correlation coefficient of r=0.36.
I then created a boxplot showing the shares, classified by the categorical variable of channel that I had created. The summary data (which included all channels) indicated that the median number of shares was around 1400 shares per article, though numerous outliers skew this significantly, so that the mean was 3436 shares. Looking at the plots, there do not seem to be dramatic differences in the data based upon which channel an article has been classified under.
channelsBoxplot <- ggplot(newsDataTrain, aes(x=channel, y=shares))
channelsBoxplot + geom_boxplot()
Creating ggpairs
Plots
I also created two sets of ggpairs plots, to look at possible relationships in a different way. I again used the variables I looked at in the sixth correlation plot from above.
newsDataTrain %>% select("num_keywords", "n_tokens_title","n_tokens_content", "num_hrefs", "num_imgs", "shares") %>% ggpairs()
newsDataTrain %>% select("num_videos", "average_token_length", "num_keywords", "global_rate_positive_words", "global_rate_negative_words", "shares") %>% ggpairs()
Modeling
Multiple Linear Regression
To begin, I wanted to first establish a better understanding of what the model would look like if I was to incorporate ALL main effects, excluding any interactions or quadratic terms.
Creating Model with All Main Effects
dataFitAll <- lm(shares~., data=newsDataTrain)
#summary(dataFitAll)
The Adjusted R-square values is a very low 0.05065, but I do see that the F-statistic is 6.075, with a very small p-value; it appears that the F-test is significant, in turn suggesting that the model is significant. With an alpha value of 0.01, the following variables are considered significant:
- n_tokens_title
- num_hrefs
- num_imgs
- average_token_length
- kw_avg_max
- kw_min_avg
- kw_max_avg
- kw_avg_avg
- LDA_00
- global_subjectivity
- min_positive_polarity
- data_channel_is_lifestyle
- data_channel_is_entertainment
- data_channel_is_bus
- data_channel_is_socmed
- data_channel_is_tech
- data_channel_is_world
However, since we have not focused on this tactic for selecting variables, I decided to forgo this particular approach, and focus instead upon the correlations that I found in my summarization above. Though all these corrlations were very weak, I’m going to experiment with several different combinations of variables that exhibited the highest correlations to our target variable of shares. I tried not to include variables that were related or otherwise correlated with one another, such as min_negative_polarity or max_negative_polarity; in this case, I instead just chose to include avg_negative_polarity.
The variables, in descending correlation coefficient, were as follows:
- LDA_03, with a correlation coefficient of 0.07
- avg_negative_polarity, -0.06
- kw_avg_max, 0.06
- LDA_02, -0.05
- data_channel_is_world, -0.04 … *For now, I decided
- num_hrefs, 0.032
- average_token_length, -0.030
- global_subjectivity, 0.030
- LDA_04, -0.030
- num_videos, 0.028
- num_imgs, 0.023
Creating Linear Regression Models
I created linear regression models, adding one variable at a time.
dataFit1 <- lm(shares~LDA_03,
data=newsDataTrain)
dataFit2 <- lm(shares~LDA_03 +
avg_negative_polarity,
data=newsDataTrain)
dataFit3 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity,
data=newsDataTrain)
dataFit4 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max,
data=newsDataTrain)
dataFit5 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02,
data=newsDataTrain)
dataFit6 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
num_hrefs,
data=newsDataTrain)
dataFit7 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
num_hrefs +
average_token_length,
data=newsDataTrain)
dataFit8 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
data_channel_is_world +
num_hrefs +
average_token_length +
global_subjectivity,
data=newsDataTrain)
dataFit9 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
data_channel_is_world +
num_hrefs +
average_token_length +
global_subjectivity +
LDA_04,
data=newsDataTrain)
dataFit10 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
data_channel_is_world +
num_hrefs +
average_token_length +
global_subjectivity +
LDA_04 +
num_videos,
data=newsDataTrain)
dataFit11 <- lm(shares~LDA_03 +
avg_negative_polarity +
avg_negative_polarity +
kw_avg_max +
LDA_02 +
data_channel_is_world +
num_hrefs +
average_token_length +
global_subjectivity +
LDA_04 +
num_videos +
num_imgs,
data=newsDataTrain)
Choosing the Best Linear Regression Model
I then compared these 12 models, using Adjusted R-squared and AIC.
Comparing Fit Statistics
compareAdjR2 <- function(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, dfall) {
adjR2stats <- data.frame(fitStat="Adjusted R Square",
dataFit1=round(summary(df1)$adj.r.squared, 5),
dataFit2=round(summary(df2)$adj.r.squared, 5),
dataFit3=round(summary(df3)$adj.r.squared, 5),
dataFit4=round(summary(df4)$adj.r.squared, 5),
dataFit5=round(summary(df5)$adj.r.squared, 5),
dataFit6=round(summary(df6)$adj.r.squared, 5),
dataFit7=round(summary(df7)$adj.r.squared, 5),
dataFit8=round(summary(df8)$adj.r.squared, 5),
dataFit9=round(summary(df9)$adj.r.squared, 5),
dataFit10=round(summary(df10)$adj.r.squared, 5),
dataFit11=round(summary(df11)$adj.r.squared, 5),
dataFitall=round(summary(dfall)$adj.r.squared, 5))
}
compareAdjR2results <- compareAdjR2(dataFit1, dataFit2, dataFit3, dataFit4, dataFit5, dataFit6, dataFit7, dataFit8, dataFit9, dataFit10, dataFit11, dataFitAll)
compareAdjR2results
## fitStat dataFit1 dataFit2 dataFit3 dataFit4 dataFit5 dataFit6
## 1 Adjusted R Square 0.01801 0.01822 0.01822 0.01896 0.02049 0.02093
## dataFit7 dataFit8 dataFit9 dataFit10 dataFit11 dataFitall
## 1 0.02168 0.0225 0.02261 0.02305 0.02306 0.03558
compareAIC <- function(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, dfall) {
adjR2stats <- data.frame(fitStat="Adjusted R Square",
dataFit1=round(AIC(df1), 5),
dataFit2=round(AIC(df2), 5),
dataFit3=round(AIC(df3), 5),
dataFit4=round(AIC(df4), 5),
dataFit5=round(AIC(df5), 5),
dataFit6=round(AIC(df6), 5),
dataFit7=round(AIC(df7), 5),
dataFit8=round(AIC(df8), 5),
dataFit9=round(AIC(df9), 5),
dataFit10=round(AIC(df10), 5),
dataFit11=round(AIC(df11), 5),
dataFitall=round(AIC(dfall), 5))
}
compareAICresults <- compareAIC(dataFit1, dataFit2, dataFit3, dataFit4, dataFit5, dataFit6, dataFit7, dataFit8, dataFit9, dataFit10, dataFit11, dataFitAll)
compareAICresults
## fitStat dataFit1 dataFit2 dataFit3 dataFit4 dataFit5 dataFit6
## 1 Adjusted R Square 82765.59 82765.76 82765.76 82763.76 82758.51 82757.72
## dataFit7 dataFit8 dataFit9 dataFit10 dataFit11 dataFitall
## 1 82755.66 82754.32 82754.84 82754.07 82755.01 82741.24
Of all the models: dataFitAll has the highest Adjusted R-square value(0.05065) and lowest AIC (97682.44).
Of the models I created, the one with the most variables (dataFit11) has the highest Adjusted R-square value (0.02365) and lowest AIC (97775.44).
Creating Multiple Linear Regression Model Using the 16 Significant Predictive Variables
Though not covered in the class notes, I decided to experiment with a slightly different approach. Working off the model that fits all main effects, I tried to reduce the number of variables using backwards selection - starting with all variables, and then narrowing them down based upon p-value.
dataFitSignif16 <- lm(shares~n_tokens_title +
num_imgs +
average_token_length +
kw_avg_max +
kw_max_avg +
kw_min_avg +
kw_avg_avg +
LDA_00 +
global_subjectivity +
min_positive_polarity +
data_channel_is_lifestyle +
data_channel_is_entertainment +
data_channel_is_bus +
data_channel_is_socmed +
data_channel_is_tech +
data_channel_is_world,
data=newsDataTrain)
summary(dataFitSignif16)$adj.r.square
## [1] 0.03487538
AIC(dataFitSignif16)
## [1] 82711.44
Incorporating these 16 variables into the model reduces the adjusted R-square to 0.05102, a small improvement. It also yielded a better AIC, of 97647.84, slightly lower than the dataFitAll all-inclusive model.
Now, I’m going to remove the two variables that are NOT significant from dataFit16, and conduct another linear regression.
dataFitSignif14 <- lm(shares~n_tokens_title +
kw_avg_max +
kw_max_avg +
kw_min_avg +
kw_avg_avg +
LDA_00 +
global_subjectivity +
min_positive_polarity +
data_channel_is_lifestyle +
data_channel_is_entertainment +
data_channel_is_bus +
data_channel_is_socmed +
data_channel_is_tech +
data_channel_is_world,
data=newsDataTrain)
summary(dataFitSignif14)$adj.r.square
## [1] 0.03478996
AIC(dataFitSignif14)
## [1] 82709.81
However, even though the result returned indicates that all variables are significant at the alpha=0.05 level, this model reduced the Adjusted R-square to 0.05055098. I will not be further reducing this model and would prefer to use the previous model, with the higher Adjusted R-square value (dataFit16).
Comparing Root MSE
While the fit statistics are one way to determine which model is best, prediction error is actually preferred. After looking at several options, I decided to use cross-validation, per the HW 9 prediction key from Avy. With this approach, which uses 10 folds, I was able to calculate the Root Mean Square Error for the two models on the Training set. Root MSE is a measure of the differences between predicted values and the actual observed values; when comparing numbers, a lower number means a model that’s more effective at prediction.
control <- trainControl(method="cv", number=10)
fit11Train <- train(as.formula(dataFit11), data=newsDataTrain, method="lm", trControl=control)
fit11Train$results$RMSE
## [1] 7410.049
fitSignif16Train <- train(as.formula(dataFitSignif16), data=newsDataTrain, method="lm", trControl=control)
fitSignif16Train$results$RMSE
## [1] 7319.897
I then did the same with the Test set.
control <- trainControl(method="cv", number=10)
fit11Test <- train(as.formula(dataFit11), data=newsDataTest, method="lm", trControl=control)
fit11Test$results$RMSE
## [1] 7971.706
fitSignif16Test <- train(as.formula(dataFitSignif16), data=newsDataTest, method="lm", trControl=control)
fitSignif16Test$results$RMSE
## [1] 8109.797
We see that the model fitSignif16Test has a lower RSME for both the Training and Test set; therefore, when looking at our linear models, we would choose this one.
Creating Ensemble Model with Random Forest
Since we’re focusing on prediction when it comes to the accuracy of our
models, I decided to use the random forest method for my ensemble
learning method. As noted in my introduction, the random forest method
tends to be a superior verison of the bagging method for bootstrap
aggregation, as it helps avoid . For this purpose, I was originally
going to use the caret
package, as demonstrated in Homework 12;
however, the computation time was VERY long, and I found that the
randomForest()
function from the randomForest
package to be more
efficient.
#trctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
#set.seed(1)
#rfNewsDataFitCaret <- train(shares~., data=newsDataTrain, method="rf",
# trainControl=trctrl,
# preProcess = c("center", "scale"))
As suggested in the notes (and confirmed by this Duke ArcToolbox Fit
Random Forest Model
resource),
the mtry
option helps us get a smaller number of predictors, with the
default number being the total number of predictor variables divided by
three.
rfNewsDataFit <- randomForest(shares~., data=newsDataTrain, mtry=ncol(newsDataTrain)/3, ntree=200, importance=TRUE)
rfNewsDataPred <- predict(rfNewsDataFit, newdata=newsDataTest)
summary(rfNewsDataPred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 940 2081 2924 3747 4405 27240
fit16NewPred <- predict(dataFitSignif16, newdata=newsDataTest)
summary(fit16NewPred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 286.7 2152.4 2772.6 3268.1 3810.8 29318.3
Comparing Two Models on Test Data Set
I compared the random forest model to to our linear model from before, dataFit16. To compare the two models and determine which is preferrable, I calculated the root MSE values on the Test dataset that I created earlier.
rfRMSE <- sqrt(mean((rfNewsDataPred-newsDataTest$shares)^2))
rfRMSE
## [1] 8949.476
datafit16RMSE <- sqrt(mean((fit16NewPred-newsDataTest$shares)^2))
datafit16RMSE
## [1] 8874.317
Both models have a very similar RMSE. A lower RSME is preferred; in this case, the linear regression model - dataFit16 - is considered a better fit than the random forest option.
Automation & Conclusions
I was unable to complete the automation component of this assignment. Instead, I knit 7 separate reports. I included conclusions about each of the sets on my Project-2 GitHub Pages site.