We’re about to enter the statistical modeling portion of our course. Our first stop will be to look at classification schemes appropriate for binary (two-class) classification problems. We’ll be working with a synthetic data set on the fictitious Spaceship Titanic! Like its namesake vessel, the Spaceship Titanic encountered tragedy when several of its passengers were warped to an alternate dimension during flight! We’ll try to build a model that can help us understand who was transported. It will help if you come to class already familiar with the data set, so you’ll do some exploration in this homework assignment.

Complete the following:

  1. Open RStudio and use File -> Recent Projects... to select the R Project which is managing your GitHub repository. Confirm that you want to open your project.

  2. Use File -> New File -> R Markdown... to create a new R Markdown notebook. Fill in the fields in the box that opens, using a meaningful file name…like Spaceship Titanic Rescue. Confirm all of your choices a click the button to create the file.

  3. In the setup code chunk near the top of the notebook, add code to load the {reticulate} library. You may also want to load {kableExtra} to nicely format any tabular output.

  4. Add a Python code chunk to your notebook and import {pandas} and any {plotnine} functions you plan to use (you’ll add to this list as you are constructing your notebook). You’ll want to import train_test_split from sklearn.model_selection here as well.

  5. Navigate to the location of a copy of the data set here. Copy the URL from your web browser.

  6. Back in RStudio and in that same setup code chunk, read in the Spaceship Titanic data using pd.read_csv() with the URL to the raw data as the only argument to the function. Be sure that you surround the URL with quotes and that you are storing the result into a named object – perhaps spaceship_data.

  7. Do some basic exploration of the data set. How many rows are there? How many columns? Are there missing values? Try using inline R code if you can.

  8. Use the train_test_split() function to split your data into training and test sets. Don’t forget to set a seed with the random_state argument, and you’ll likely want to use the strata argument to ensure proportional representation of the groups corresponding to the Transported column in both sets.

  9. Conduct an exploratory data analysis on your training data.

  10. When you are done, use the Git tab in the top right pane of RStudio to Pull, Commit, Push your new notebook and changes out to your remote repository.

We’ll pick up from here at our next class meeting. Stop by my office if you have any questions or need help.