MAT 434: Introduction and What to Expect

Dr. Gilbert

January 6, 2025

What Are We Here For?

What Are We Here For?

What Are We Here For?

What Are We Here For?

What Are We Here For?

Software Installation

Where is everyone at?

  • R?
  • RStudio?
  • Git?

I’ll help you complete the installations throughout today’s discussion

Syllabus

Major Highlights from the Syllabus: I’ll ask you to read the syllabus, but the most important items are on the following slides.

Instructor and Office Hours

  • Instructor: Dr. Adam Gilbert

    • e-mail address: a.gilbert1@snhu.edu

    • Office: Robert Frost Hall, Room 311

    • Office Hours:

      • Mondays 9:30am - 10:30am
      • Thursdays 9:00am - 11:00am
      • Fridays 10:00am - noon

Required Resources

First and foremost…everything is free!

Grading Scheme

Grade Item Value
Participation 10%
Homework (~6) 25%
Competition Assignments (~6) 40%
GitHub Pages Portfolio 5%
Project 20%

Explanations of Grade Items

  • Participation: Come to class and contribute actively
  • Homework: These assignments are mostly onramping assignments to get you up to speed with our software. We’ll have about six.
  • Competition Assignments: We’ll learn classification by doing classification. We have six planned assignments associated with a modeling competition on price ranges for homes.
  • GitHub Pages Portfolio: I want to help you all build a professional, outward facing portfolio to share your work with potential employers or graduate schools. You’ll be able to share pieces of work from this course as well as others there and easily add a link to the portfolio on your resume. My hope is that you’ll use this beyond just our class.
  • Project: A final course project spanning our last three weeks together.

Competition?

  • Hosted at Kaggle
  • Closed to only students in our course
  • You’ll need a free account – link in Slack
  • Your grade is not tied to your finishing place
  • The competition aspect is friendly – please keep it that way (although good-natured heckling and trash talk is fine as long as all parties approve).
  • Former MAT300 students, what would you share with our new friends about the competition?

Brightspace

  • Weekly Announcement
  • Assignments
  • Gradebook
  • Go to the webpage for everything else

Course Webpage

I’ve built a webpage to organize our course content.

  • Syllabus

  • Tentative timeline

    • Truly tentative – we can slow down, speed up, swap out topics, etc.
    • Links to short “explainers” to look over before each class meeting so that we can spend our time in class working with data
    • Links to assignments – you can see everything now.

What’s Class Like?

  • The beginning of the semester will include about 3 weeks of on-ramping to make sure we’re all up to speed with R, Quarto, {tidymodels}, and GitHub
  • My hope is that we’ll generally spend class time actively working with data rather than listening to lectures
  • Ideally, you’ll collaborate with one another

A Note on My Approach to Class

  • I’m open to change in all of my courses.
  • If the structure isn’t working for you, let’s chat and see what changes we can make to improve your experience.
  • If you don’t want to tell me in person, leave an anonymous note under my office door.

My goal in this course is for all of you to learn as much about classification and statistical modeling as possible – we can’t achieve that if you don’t feel like you are benefiting from our class meetings.

AI Use

  • While we will be interfacing with data by running code, MAT434 is not a programming course

  • Because of this, I’ll encourage you to work with your favorite AI as an assistant

    • Don’t have the AI do the work for you
    • Do have the AI help you fix broken code
    • Do have the AI help you “trick out” your plots
    • Do have the AI help you improve your document formatting

Summary:Do this for me…” requests are against the rules, but “How do I…”, “Help me fix this…” or “Help me make this better…” are all encouraged. This goes for R code, markdown and formatting in Quarto, and your writing.

A Road Map to Our Semester

  • I’ve planned for us to discuss a lot of material in MAT434
  • We do have the ability to slow down, change focus, swap topics in or out, etc.
  • What follows is a very generic road map of what we will discuss. Starting now.

What Are We Doing?

  • Artificial Intelligence (AI)?
  • Data science?
  • Machine learning?
  • Statistical learning?

Background For Working With Data

  • What is data?
  • How to work with data: Don’t spend it all in one place!
  • How do we visualize and story-tell with data?
  • What if my data is messy? (Spoiler Alert: It will be!)

Ethics, Data Use, and Models

  • There’s probably room for a full course on this topic

  • Here are a couple of things to keep in mind

    • Models should do no harm, especially with historically marginalized groups or vulnerable populations
    • Our models are trained on historical data in order to make future predictions or decisions
    • If our historical data has biases, then we are at risk of training a model to be biased in the same ways
    • Check out this blog post by Simon P. Couch on *fair machine learning with {tidymodels}

What Foundational Statistics Knowledge Do I Need?

Very little, actually

  • Some intuition about randomness, noisy data, and uncertainty

  • Approximate Confidence Intervals: \(\left(\text{point estimate}\right) \pm 2\cdot \left(\text{standard error}\right)\)

  • Basic Hypothesis Testing: \(p< \alpha\) means data are incompatible with a null (skeptical) hypothesis

    • For us, this will generally mean that a predictor variable provides explanatory value
    • …and this is really only relevant for our first classification model

Classification

  • What is classification?
  • How do we assess model performance?
  • Model classes

    • Logistic regression
    • Support vector machines
    • Nearest neighbor classifiers
    • Decision trees
    • Bagged trees
    • Random forests
    • Gradient boosted trees
  • Feature engineering

    • Principal components analysis
    • Text-based features
  • Deep Learning and Basic Neural Networks

Homework (Part I)

Homework 1: Finish the software installation (due at the beginning of Wednesday’s class) – come see me if you need help

Read Chapter 1 (pages 1 - 14) of the Introduction to Statistical Learning (ISLR) book, or watch the corresponding videos from the textbook authors (the first two videos in the playlist).

  • Discusses three arenas where statistical learning is applied

    • Regression, Classification, and Unsupervised Learning
    • Our focus is classification, but knowing about all three will help you grasp what we are trying to do in our class

Homework (Part II)

(Recommended) Work through the Topic 2 and Topic 3 Tutorial notebooks for an intro to R and how to compute summary statistics. Before Wednesday of Week 2, work through Topic 4 on data visualization.

Stop by my office (Robert Frost 311), say hi and let’s briefly chat about the following:

  • Why are you taking this course?
  • What do you hope to get out of it?
  • What contexts do you want us to pull data from in our course? (animal physiology, psychology, medicine, business/economics, sports, etc.)

Next Time…

  • Our first GitHub Repo
  • Creating an R Project
  • Working with data in R
  • Mixing narration and code with Quarto
  • Pushing changes to your repo