Competition Assignment 2
In your first competition assignment, you joined our In-Class Kaggle Competition, downloaded the data for the competition, read it into an Quarto Notebook, and wrote a first draft of a statement of purpose for an analytics project. You’ll add to that work here.
Re-open the Quarto Notebook that contains your Statement of Purpose from Competition Assignment 1.
Re-run the code necessary to read in your data files.
Write code to split the data coming from
data_train.csv
into three sets:train
,validation
andtest
. You’ll need two separate calls toinitial_split()
for this. Be sure to set a seed, withset.seed()
before each call.
- Reach out on Slack if you have any questions about how to do this!
- Remember that the
validation
andtest
sets should stay hidden until later parts of the analytics project. Conduct an exploratory analysis on the training data (train
).
Similar to what you’ve been seeing and doing in class, your exploratory analysis should mix both code and text. I encourage you to look at previous notebooks for examples of how to do this as well as for examples of different data visualization techniques. In building your submission, it is important to keep only working code in your Quarto Notebook, and to keep only relevant plots and summary statistics. Remember, your goal in this project is to build a model to predict a variable of interest. The majority of your explorations should be attempts at finding predictors which are associated with that response variable.
Focus your efforts here on summary statistics and data visualization. You may also look for issues such as missing data. You do not, however, need to do any feature engineering at this point (although, to win the competition, you’ll almost certainly need utilize those techniques).
- Once you are done, render your Quarto document to HTML and submit both your Quarto and HTML file using the Competition Assignment 2 folder in BrightSpace. As a reminder, your submission should look like a partial report, including only the Statement of Purpose and Exploratory Data Analysis sections. Your report will mix text and code like you’ve seen, and built, in our class notebooks. All of your code should come with context. Be sure to answer the questions “What do the outputs mean and why do we care?”.
– Dr. G