Project 2 - Trial & Error While Examining Mashable Data
This project is not my proudest moment - admittedly, I struggled with multiple elements. The most difficult components for me included the final automation of the report; I watched the Module 8 video multiple times, paid attention to activity on the forums, and did about five hours of research through StackExchange and other sources, but I’m afraid that this component entirely evaded me. My understanding is that there needs to be a separate script - perhaps a loop - in the README.md file, which utilizes the apply()
function to scan through each of the columns that starts with “weekday_is_”, with each outputting to a separate github document that could then be linked to within the README. I am also under the impression that several changes to the YAML must be made, as to set the parameters; but putting these steps together has proven to be a major sticking point for me, and I’m afraid I had to generate the reports manually, filtering by columns - not the ideal solution I was hoping for. In retrospect, I should have requested help earlier on this issue. I felt foolish for not understanding the concept, but it was pure hubris on my part to think I would just “figure it out”.
Second, I struggled with fully understanding the concepts behind variable selection for our linear model. In Homework 9, I had tried to examine p-values and utilize stepwise regression to choose the best variables, but this was outside the scope of our coursework. When looking at the correlation coefficients for this project, I found that the adjusted R-square values were extraordinarily low, but I chose what I thought would be most satisfactory. In hindsight, I wish I had created criteria for and then removed outliers, as these likely hampered my ability to get great correlation coefficients. Next time, I would like to examine the possibility of including interactions and/or quadratics in my equations. Additionally, I would like to explore other ways to conduct the linear regression analysis, as to more effectively select a model. I was pleased with the efficiency of the randomForest()
function for the ensemble model, but believe I could have gone farther with this. I would have liked to have also further used the channel variable I had created, to group the results of the predictions by what channel the articles were classified under. And last but not least, I wish I had devised a loop for my fit statistics for the linear model!
One of my major takeaways from this project is that there are no perfect solutions when it comes to finding a model that best suits your data. There are many approaches to fitting a model, and interpretation is largely subjective and must be kept in context of the industry one is working within. Additionally, automation should be a means by which we can easily create multiple reports at once, reducing or entirely eliminating the need for tedious adjustments to our code along the way; knitting each of my reports manually was a certainly a potent reminder of this.
To view my project, please visit this link.
I look forward to putting together a strong final project over the coming weeks!