#Load {tidymodels}
#Put this line under library(tidyverse)
library(tidymodels)
# Set a seed for reproducibility
# Change the number if you'd like (recommended)
set.seed(123)
#Mark your rows as belonging to training or test (temporary) sets
#Change "data" to the name of your data frame.
#Change the proportion if you'd like.
#The hold out sets need enough rows for you to
#test/assess models on, but any observations there
#are observations you don't get to "learn" from.
data_splits <- initial_split(data, prop = 0.75)
#Extract the training observations
train <- training(data_splits)
#We'll split these observations again
temp <- testing(data_splits)
#Again, change the seed if you like
set.seed(456)
#Split the hold-out observations into validation
#and test sets
temp_splits <- initial_split(temp, prop = 0.5)
validation <- training(temp_splits)
test <- testing(temp_splits)Competition Assignment 2
In your first competition assignment, you joined our In-Class Kaggle Competition, downloaded the data for the competition, read it into an Quarto Notebook, and wrote a first draft of a statement of purpose for an analytics project. You’ll add to that work here.
Re-open the Quarto Notebook that contains your Statement of Purpose from Competition Assignment 1.
Re-run the code necessary to read in your data files.
Use the following code to split the data coming from
data_train.csvinto three sets:train,validationandtest.
There’s a better way to achieve this three-set split using initial_validation_split(), but something better is coming later and we’ll only have need for initial_split().
- Remember that the
validationandtestsets should stay hidden until later parts of the analytics project. Conduct an exploratory analysis on the training data (train).
Similar to what you’ve been seeing and doing in class, your exploratory analysis should mix both code and text. I encourage you to look at previous notebooks for examples of how to do this as well as for examples of different data visualization techniques. In building your submission, it is important to keep only working code in your Quarto Notebook, and to keep only relevant plots and summary statistics. Remember, your goal in this project is to build a model to predict a variable of interest. The majority of your explorations should be attempts at finding predictors which are associated with that response variable.
Focus your efforts here on summary statistics and data visualization. You may also look for issues such as missing data. You do not, however, need to do any feature engineering at this point (although, to win the competition, you’ll almost certainly need utilize those techniques).
- Once you are done, render your Quarto document to HTML and submit both your Quarto and HTML file using the Competition Assignment 2 folder in BrightSpace. As a reminder, your submission should look like a partial report, including only the Statement of Purpose and Exploratory Data Analysis sections. Your report will mix text and code like you’ve seen, and built, in our class notebooks. All of your code should come with context. Be sure to answer the questions “What do the outputs mean and why do we care?”.
– Dr. G