MAT434: Homework 5
We’re about to enter the statistical modeling portion of our course. Our first stop will be to look at classification schemes appropriate for binary (two-class) classification problems. We’ll be working with a synthetic data set on the fictitious Spaceship Titanic! Like its namesake vessel, the Spaceship Titanic encountered tragedy when several of its passengers were warped to an alternate dimension during flight! We’ll try to build a model that can help us understand who was transported. It will help if you come to class already familiar with the data set, so you’ll do some exploration in this homework assignment.
Complete the following:
Open RStudio and use
File -> Recent Projects...
to select the R Project which is managing your GitHub repository. Confirm that you want to open your project.Use
File -> New File -> Quarto Document...
to create a new Quarto notebook. Fill in the fields in the box that opens, using a meaningful file name…likeSpaceship Titanic Rescue
. Confirm all of your choices a click the button to create the file.In the
setup
code chunk near the top of the notebook, add code to load the{tidyverse}
and{tidymodels}
packages. You may also want to load{kableExtra}
to nicely format any tabular output, and{patchwork}
for easy organization of plots.Navigate to the location of a copy of the data set here. Copy the URL from your web browser.
Back in RStudio and in that same
setup
code chunk, read in the Spaceship Titanic data usingread_csv()
with the URL to the raw data as the only argument to the function. Be sure that you surround the URL with quotes and that you are storing the result into a named object – perhapsspaceship_data
.Do some basic exploration of the data set. How many rows are there? How many columns? Are there missing values? Try using inline R code if you can.
Use the
initial_split()
,training()
andtesting()
functions to split your data into training and test sets. Don’t forget to set a seed withset.seed()
, and you’ll likely want to use thestrata
argument forinitial_split()
to ensure proportional representation of the groups corresponding to theTransported
column in both sets.Conduct an exploratory data analysis on your training data.
When you are done, render your document and use the
Git
tab in the top right pane of RStudio to Pull, Commit, Push your new notebook and changes out to your remote repository.
We’ll pick up from here at our next class meeting.
Stop by my office if you have any questions or need help.