Welcome to INFO 4940/5940: Applied Machine Learning

Lecture 1

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

August 27, 2024

Agenda

Agenda

  • Introductions
  • What is machine learning?
  • Software
  • What is INFO 4940?
  • This week’s tasks

Staff intros

Meet the instructor

Dr. Benjamin Soltoff

Lecturer in Information Science

Gates Hall 216

Headshot of Dr. Benjamin Soltoff

Meet the course team

TAs

  • Saif M.
  • Sam G.
  • Yuhan T.

Meet each other!

  • Name
  • Major
  • The last show you’ve watched
02:00

What is machine learning?

What is machine learning?

What is machine learning?

What is machine learning? (2024 edition)

Your turn

How are statistics and machine learning related?

How are they similar? Different?

03:00

The “whole” game

A better way to think about it

A flowchart diagramming the machine learning operations lifecycle, including collecting data, understanding and cleaning data, training and evaluating models, deploying model, and monitoring model.

How is machine learning used in practice?

How we will do this

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R,1 but it’s very commonly used by R programmers and data scientists

Major differences between R and Python

R Python
Syntax Functional language Object-oriented language
Statistical learning Developed by statisticians for statistical analysis Meh
Machine learning
Deep learning
Visualization {ggplot2} {matplotlib} + others
Package management CRAN pip/virtualenv/PyPI/Anaconda
Speed Somewhat slower Somewhat faster
Community Academia and industry Larger (general-purpose programming language)

tidymodels

Hex logos for tidymodels, rsample, parsnip, recipes, tune, and yardstick

tidymodels.org

  • The {tidymodels} is a collection of packages for modeling and machine learning using {tidyverse} principles.
  • All packages share an underlying philosophy and a common grammar

Who is this class for?

Armando

A professional headshot for Armando with a neutral background. Armando is in his early 20s. Created by DALL·E.

  • PhD student in political science
  • Feels comfortable with regression and econometric methods in R
  • Will be using text classification on a large corpus for his dissertation
  • Seeks a reproducible workflow to train and evaluate his models

Palmer

A professional headshot for Palmer with a neutral background. Palmer is biracial (Black and Indian descent) and 24 years old. Created by DALL·E.

Karen

A professional headshot for Karen with a neutral background. Karen is Caucasian, 22 years old, a student at Cornell University, wearing glasses, and in natural lighting. Created by DALL·E.

  • Fourth-year undergraduate student in information science, concentrating in data science
  • Took INFO 3950 (Data Analytics for Information Science) last year
  • Wants to learn how to use machine learning models for production, and combine her theoretical knowledge with practical applications

Chen

A professional headshot for Chen with a neutral background. Chen is of Chinese descent, 22 years old, uses a wheelchair for mobility, and has a confident expression that reflects her strength. Created by DALL·E.

  • Born and raised in Shenzhen, China
  • Information science major, plans to apply for industry positions
  • Completed a summer internship at Dow Chemical and saw the analytics team was using R for predictive modeling
  • Wants to learn more about machine learning and how to apply it to real-world problems

Course overview

Homepage

https://info4940.infosci.cornell.edu/

  • All course materials
  • Links to Canvas, GitHub, RStudio Workbench, etc.
  • Let’s take a tour!

Course toolkit

All linked from the course website:

Important

Make sure you can access RStudio before class on Thursday.

Activities: Prepare, Participate, Practice, Perform

  • Prepare: Introduce new content and prepare for class by completing the readings

  • Participate: Attend and actively participate in class, office hours, and team meetings

  • Practice: Practice applying ML techniques and computing with application exercises during class, graded for completion

  • Perform: Put together what you’ve learned to analyze real-world data

    • Homework assignments x 6(-ish) (individual)
    • Team project

Activities: Participate

Preparing for and participating in class

Not preparing for class, not actively participating

Cadence

  • Application exercises: Complete by the end of the day
  • HWs: Posted Friday morning, due following Wednesday 11:59pm
  • Project: Deadlines throughout the semester, with some class time dedicated to working on them, and most work done in teams outside of class

Grading

Category Percentage
Homework 50%
Project 40%
Application Exercises 10%

See course syllabus for how the final letter grade will be determined.

15 minute rule

;document.getElementById("tweet-40041").innerHTML = tweet["html"];

Support

  • Attend office hours
  • Ask and answer questions on the discussion forum
  • Reserve email for questions on personal matters and/or grades
  • Read the course support page

Announcements

  • Posted on Canvas (Announcements tool), be sure to check regularly (or forward announcements to your email)
  • I’ll assume that you’ve read an announcement by the next “business” day

Diversity + inclusion

  • I want you to feel like you belong in this class and are respected
  • We are committed to full inclusion in education for all persons
  • If you feel that we have failed these goals, please either let us know or report it, and we will address the issue

Accessibility

I want this course to be accessible to students with all abilities. Please feel free to let me know if there are circumstances affecting your ability to participate in class.

Course policies

Late work, waivers, regrades policy

  • We have policies!
  • Read about them on the course syllabus and refer back to them when you need it

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Homeworks must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source

  • All code must be written by you, the human being

Generative AI

  • Use generative AI to facilitate, rather than hinder, learning

  • GAI tools for reference purposes

    How do I make a scatterplot using ggplot2 in R?

  • 🤔 GAI tools for writing my code/analysis

    • You may use GAI tools to assist in writing code in this class

    • You may not make use of the technology as a substitute for critical thinking

    • I reserve the right to orally assess any student on their submissions to verify they meet the learning objectives for the assignment

  • GAI tools for narrative

  • You are ultimately responsible for the work you turn in; it should reflect your understanding of the course content

Academic integrity

  1. A student shall in no way misrepresent his or her work.
  2. A student shall in no way fraudulently or unfairly advance his or her academic position.
  3. A student shall refuse to be a party to another student’s failure to maintain academic integrity.
  4. A student shall not in any other manner violate the principle of academic integrity.

Most importantly!

Ask if you’re not sure if something violates a policy!

Application exercise

Application exercise

  • What do you hope to learn from this class?
  • Based on the syllabus and current list of topics, what are you most excited about doing in this course?
  • What do you think is currently missing from the class that should be added (e.g. topics, assignments, techniques)? Are there certain things you want reduced and/or eliminated to make additional space for other topics?

Discuss with your peers, then submit your individual responses.

08:00

Wrap-up

Before Thursday

A film recommendation (nine years late…)

A poster for the film The Descendants featuring Evie, Mal, Carlos, and Jay.