Identifying and collecting data

Lecture 8

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

September 24, 2024

Announcements

Announcements

  • Grading homework 01
  • Are you still taking this class?

Learning objectives

  • Review techniques for collecting and importing data
  • Identify common sources of data for data science projects
  • Explore data collection strategies for the team project

Reading data into R

Reading rectangular data

  • readr:
    • Most commonly: read_csv()
    • Maybe also: read_tsv(), read_delim(), etc.
  • readxl: read_excel()
  • arrow: read_arrow(), read_parquet()
  • haven: read_sas(), read_sav(), read_dta()
  • googlesheets4: read_sheet()
  • data.table: fread()1

Databases

Structured Query Language (SQL)

  • PostgreSQL, MariaDB, SQL SErver, Snowflake, BigQuery, SQLite, duckdb, etc.
  • Access data on disk instead of in memory
  • Export data to disk only when necessary

{DBI} + {dbplyr}

Application programming interface (API)

  • Representational State Transfer (REST)
  • Uniform Resource Location (URL)
  • Use R package implementation (if available)
  • If not, write your own API wrapper with {httr2}
  • Be prepared for significant data wrangling and rectangling

Screen scraping

Data collection

Define the goal

  • What is the problem you are trying to solve?
  • What do your sponsors want to know?
  • How will you deploy your results?
  • Aim for concrete goals
  • Identify clear stopping conditions (e.g. time/financial constraints)
  • Prepare to adjust/abandon the goal as you determine what is feasible

Sources of data

Application exercise

ae-07

  • Work in groups of 3-4 individuals
  • Share your project topic ideas and pick one to focus on for class today
  • Develop a data collection strategy
  • What variables do you need?
  • What are their units of analysis? How will they be measured?
  • How will you obtain these variables?

Wrap-up

Recap

  • R works with lots of different data files/structures
  • Finding good data is hard (and sometimes πŸ’°πŸ’°πŸ’°)
  • Don’t hesitate to reach out for assistance finding appropriate measures