Project management

Lecture 11

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2025

September 30, 2025

Announcements

Learning objectives

  • Identify project-oriented workflows for machine learning projects
  • Define reproducible computing environments
  • Utilize {renv} and uv to maintain project dependencies
  • Implement version control with Git and GitHub
  • Structure machine learning projects for collaboration and reproducibility
  • Brainstorm project ideas

Technical pain points for collaboration

  • File management
  • Versioning and reproducibility
  • Different toolsets
  • Knowledge transfer

Project-oriented workflows

Adopt a project-oriented workflow

Why

  • Work on more than 1 thing at a time

  • Collaborate, communicate, distribute

  • Start and stop

How

  • Dedicated directory

  • Positron workspace

  • RStudio Project

  • Git repo, probably syncing to a remote

What goes in a project?

Everything related to the project

  • Data
  • Code
  • Documentation
  • Output files
  • Reports
  • Presentations

Do you know where your files are?

Working directory vs. home directory

  • Working directory is associated with a specific process or running application
  • Home directory is a static, persistent thing
    • Differs across users
    • Differs across devices

Working directory \(\neq\) home directory

Practice safe paths

  • Use the {here}/pyhere package to build paths inside a project.

  • Leave working directory at top-level at all times, during development.

  • Absolute paths are formed at runtime.

here::here()

library(here)
here()
[1] "/Users/bcs88/Projects/info-4940/course-site"

Build a file path

here("slides/extras/awesome.txt")
## [1] "/Users/bcs88/Projects/info-4940/course-site/slides/extras/awesome.txt"
cat(readLines(here("slides/extras/awesome.txt")))
## OMG this is so awesome!

What if we change the working directory?

setwd(here("slides"))
getwd()
## [1] "/Users/bcs88/Projects/info-4940/course-site/slides"
cat(readLines(here("slides/extras/awesome.txt")))
## OMG this is so awesome!

Filepaths and Quarto documents

Users/
  bcs88/
    supreme-court/
      .git/
      data/
        scotus.csv
      analysis/
        exploratory-analysis.qmd
      final-report.qmd
  • Working directory is project root (/Users/bcs88/supreme-court/)
  • .qmd and assumption of working directory
  • Run read_csv("data/scotus.csv")
  • Run read_csv(here("data/scotus.csv"))

Positron vs. RStudio

  • RStudio detects the Quarto file location and interactively executes code cells in that context
  • Positron always interactively executes Quarto files in the context of the project root directory

Version control with Git

File tracking

What files to commit

  • Source files (.R, .py, .qmd, etc.)
  • Data files
  • Output files (PDFs, HTML, images, etc.)

What files to not commit

  • Temporary files
  • Log files
  • Files with private details
  • Any file greater than 100 megabytes

.gitignore

  • System file
  • Tells Git which files/directories to ignore

Git LFS

Synchronizing work with Git

Every time you sit down to work:

  1. Before you code, pull the latest code from the repo server (pull/sync)
  2. Code or edit files
  3. Create a commit of your changes (stage & commit)
  4. Pull the latest commits (versions) from the server (pull/sync)
  5. Resolve any merge conflicts (resolve, stage, & commit)
  6. Push commit(s) (push)

Merge conflicts

Conflicts generally arise when two people have changed the same lines in a file, or if one developer deleted a file while another developer was modifying it.

Source: Atlassian

When you pull your team member’s commits into your local repository, it may result in a conflict!

Types of merges

  1. Different files
  2. Same text file, different lines
  3. Same text file, same lines

Set rebase to FALSE

git config --global pull.rebase false

Merging with Git – different files

Git merges automatically (no human intervention required)!

Merging with Git – same file, different lines

Git merges automatically (no human intervention required)!

Merging with Git – same file, same lines

Git cannot merge automatically. Developer must resolve conflict!

Resolving merge conflicts

  1. Look for conflict markers

    <<<<<<<, =======, >>>>>>>

  2. Pick the lines you want to keep. Remove markers.

    Caution: Carefully study each line. You may need to combine the lines, not just pick one!

  3. Stage, commit, push

<<<<<<< (your work)
  <title>Zoo</title>
=======
  <title>Ithaca Zoo</title>
>>>>>>> (server)

Reproducible computing environments

Why reproducible computing environments?

Differences in:

  • Operating systems
  • File structure
  • Installed software
  • Software versions

Tools for reproducible computing environments

  • {renv} for R projects
  • uv for Python projects
  • Docker for containerized environments

Maintaining reproducible computing environments

renv::snapshot(type = "all")  # save the state of your R packages
renv::restore()               # restore the state of your R packages
uv add <package>              # add a package to the environment
uv sync                       # synchronize the environment

After pulling from the remote repo, you may need to synchronize your environment if a team member has changed it.

Integrating multiple languages

A cartoon illustration shows the R and Python programming language logos as cheerful characters. Both characters look happy and friendly against a bright yellow background. Generated using ChatGPT.

Why use multiple languages?

  • Different languages have different strengths
  • Different languages have different ecosystems
  • Different team members have different language preferences

Interoperability tips

  • Identify which languages will be used for discrete tasks
  • Implement work using language-specific scripts or Quarto documents
  • Keep computing environments up-to-date
  • Import and export data using common formats (CSV, Feather, etc.)

Knowledge transfer

Difficulties in knowledge transfer

  • People change jobs frequently
  • People forget things over time
  • Context changes over time

Mitigating knowledge transfer issues

  • Use version control
  • Use reproducible computing environments
  • Document everything
    • README files
    • Informative commit messages
    • Code comments
    • Adhere to consistent style guides
    • Avoid overreliance on generative AI and your memory

Don’t be a jackass

Team conflicts

  • Conflict is healthy (if managed well)
  • Establish expectations through your team contract
  • Resolve conflict - don’t run from it

Project grade adjustments

At the end of the project, we will review each team member’s contributions to the project.

  • Confidential peer evaluations
  • Git history of contributions
  • If students do not contribute equitably, their project grade may be adjusted downwards
  • If students are punished as a result of a slacking team member, their project grade may be adjusted upwards
  • Students will not be rewarded for doing more than their fair share

Penalties for failing to contribute

I am serious about every member of the team equitably contributing to the project. Students who fail to contribute equitably may receive up to a 100% deduction on their project grade.

Identify a project topic

📝 Brainstorm project topics

Instructions

Briefly share your project topic ideas. Select one to flesh out now.

  • State the objectives of the project
  • Identify the hypothetical stakeholders
  • Identify the target variable(s) - how will you measure the outcome of interest?
  • Identify at least three sources of data you might use to measure the target variable(s) and relevant predictors, and how you would obtain them (e.g. downloadable file, database, API, web scraping)

Wrap-up

Recap

  • Adopt a project-oriented workflow
  • Utilize version control with Git and GitHub
  • Keep reproducible computing environments up-to-date
  • Integrate multiple programming languages
  • Mitigate knowledge transfer issues with documentation
  • Contribute equitably to team projects