AE 19: Retrieval Augmented Generation (RAG)

Application exercise
R
Python
Modified

November 6, 2025

15_coding-assistant

# Task ------------------------------------------------------------------------
library(ellmer)

# **Step 1:** Run the code below as-is to try the task without any extra
# context. How does the model do? Can you run the function? Does it give you the
# weather? Does it know enough about the {weathR} package to complete the task?
#
# **Step 2:** Now, let's add some context. Head over to GitHub repo for {weathR}
# (link in `docs.R.md`). Copy the project description from the `README.md` and
# paste it into the `docs.r.md` file.
#
# **Step 3:** Uncomment the extra lines to include these docs in the prompt and
# try again.

chat <- chat("anthropic/claude-3-7-sonnet-20250219", echo = "output")

chat$chat(
  ## Extra context from package docs
  # brio::read_file(here::here("_exercises/15_coding-assistant/docs.R.md")),
  ## Task prompt
  paste(
    "Write a simple function that takes latitude and longitude as inputs",
    "and returns the weather forecast for that location using the {weathR}",
    "package. Keep the function concise and simple and don't include error",
    "handling or data re-formatting. Include documentation in roxygen2 format,",
    "including examples for NYC and Atlanta, GA."
  )
)
# %% setup
import chatlas
import dotenv
from pyhere import here

dotenv.load_dotenv()

# %% [markdown]
# **Step 1:** Run the code below as-is to try the task without any extra
# context. How does the model do? Can you run the function? Does it give you the
# weather? Does it know enough about the NWS Python package to complete the
# task?
#
# **Step 2:** Now, let's add some context. Head over to the GitHub Repo for NWS
# (link in `docs.py.md`). Copy the project description from the `README.me` and
# paste it into `docs.py.md`.
#
# **Step 3:** Uncomment the extra lines to include these docs in the prompt and
# try again.

# %% task
chat = chatlas.ChatAuto("anthropic/claude-3-7-sonnet-20250219")

chat.chat(
    ## Extra context from package docs
    # here("_exercises/15_coding-assistant/docs.py.md").read_text(),
    ## Task prompt
    "Write a simple function that takes latitude and longitude as inputs "
    "and returns the weather forecast for that location using the NWS "
    "package. Keep the function concise and simple and don't include error "
    "handling or data re-formatting. Include a short docstring, including "
    "including examples for NYC and Atlanta, GA.",
)


# %% [markdown]
# Put the result from the model in code block below to try it out.

# %% results
import NWS as weather
# ...

16_rag

#+ setup
library(ragnar)

# Step 1: Read, chunk and create embeddings for "R for Data Science" ----------

#' This example is based on https://ragnar.tidyverse.org/#usage.
#'
#' The first step is to crawl the R for Data Science website to find all the
#' pages we'll need to read in.
#'
#' Then, we create a new ragnar document store that will use OpenAI's
#' `text-embedding-3-small` model to create embeddings for each chunk of text.
#'
#' Finally, we read each page as markdown, use `markdown_chunk()` to split that
#' markdown into reasonably-sized chunks, finally inserting each chunk into the
#' vector store. That insertion step automatically sends the chunk text to
#' OpenAI to create the embedding, and ragnar stores the embedding alongside the
#' original text of the chunk.

#+ create-store

base_url <- "https://r4ds.hadley.nz"
pages <- ragnar_find_links(base_url, children_only = TRUE)

store_location <- here::here("_exercises/16_rag/r4ds.ragnar.duckdb")

store <- ragnar_store_create(
  store_location,
  title = "R for Data Science",
  # Need to start over? Set `overwrite = TRUE`.
  # overwrite = TRUE,
  embed = \(x) embed_openai(x, model = "text-embedding-3-small")
)

cli::cli_progress_bar(total = length(pages))
for (page in pages) {
  cli::cli_progress_update(status = page)

  chunks <- page |>
    read_as_markdown() |>
    # The next step breaks the markdown into chunks. This is where you have the
    # most control over what content is grouped together for embedding and later
    # retrieval. Feel free to experiment with settings in `?markdown_chunk()`.
    markdown_chunk()

  ragnar_store_insert(store, chunks)
}
cli::cli_progress_done()

ragnar_store_build_index(store)

# Step 2: Inspect your document store -----------------------------------------

#' Now that we have the vector store, what chunks are surfaced when we ask a
#' question? To do that, we'll use the ragnar store inspector app and an
#' example question.
#'
# Here's a question someone might ask an LLM. Copy the task markdown to use in
# the ragnar store inspector app.

#+ inspect-store
task <- r"--(
Could someone help me filter one data frame by matching values in another?

I’ve got two data frames with a common column `code.` I want to keep rows in `data1` where `code` exists in `data2$code`. I tried using `filter()` but got no rows back.

Here’s a minimal example:

```r
library(dplyr)

data1 <- data.frame(
    closed_price = c(49900L, 46900L, 46500L),
    opened_price = c(51000L, 49500L, 47500L),
    adjust_closed_price = c(12951L, 12173L, 12069L),
    stock = as.factor(c("AAA", "AAA", "AAC")),
    date3 = as.factor(c("2010-07-15", "2011-07-19", "2011-07-23")),
    code = as.factor(c("AAA2010", "AAA2011", "AAC2011"))
)

data2 <- data.frame(
    code = as.factor(c("AAA2010", "AAC2011")),
    ticker = as.factor(c("AAA", "AAM"))
)
```

What I tried:

```r
price_code <- data1 %>% filter(code %in% data2)
```

This returns zero rows. What’s the simplest way to do this?
)--"

ragnar_store_inspect(store)


# Step 3: Use document store in a chatbot --------------------------------------

#' Finally, ragnar provides a special tool that attaches to an ellmer chat
#' client and lets the model retrieve relevant chunks from the vector store on
#' demand. Run the code below to launch a chatbot backed by all the knowledge in
#' the R for Data Science book. Paste the task markdown from above into the chat
#' and see how the chatbot uses the retrieved chunks to improve its answer, or
#' ask it your own questions about R for Data Science.

#+ chatbot

library(ellmer)

chat <- chat(
  name = "openai/gpt-4.1-nano",
  system_prompt = r"--(
You are an expert R programmer and mentor. You are concise.

Before responding, retrieve relevant material from the knowledge store. Quote or
paraphrase passages, clearly marking your own words versus the source. Provide a
working link for every source you cite.
  )--"
)

# Attach the retrieval tool to the chat client. You can choose how many chunks
# or documents are retrieved each time the model uses the tool.
ragnar_register_tool_retrieve(chat, store, top_k = 10)

live_browser(chat)
# %%
import chatlas
import dotenv
from pyhere import here

dotenv.load_dotenv()

# %% [markdown]
# Python has a plethora of options for working with knowledge stores
# ([llama-index](https://docs.llamaindex.ai/en/stable/),
# [pinecone](https://docs.pinecone.io/reference/python-sdk), etc.). It doesn’t
# really matter which one you choose, but due to its popularity, maturity, and
# simplicity, lets demonstrate with the
# [`llama-index`](https://docs.llamaindex.ai/en/stable/) library.
#
# With `llama-index`, it’s easy to create a knowledge store from a wide variety
# of input formats, such as text files, [web
# pages](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/),
# and [much more](https://pypi.org/project/llama-index-readers-markitdown/).
#
# For this task, I've downloaded the notebook files in the [Polars
# Cookbook](https://github.com/escobar-west/polars-cookbook) and converted them
# to markdown. This snippet will ingest those markdown files files, embed them,
# and create a vector store `index` that is ready for
# [retrieval](https://posit-dev.github.io/chatlas/misc/RAG.html#retrieve-content).
#
# Creating the vector store index can take a while, so we write it to disk to
# persist between sessions.

# %%
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

polars_cookbook = here("data/polars-cookbook")
docs = SimpleDirectoryReader(polars_cookbook).load_data()
index = VectorStoreIndex.from_documents(docs)

index.storage_context.persist(
    persist_dir=here("_exercises/16_rag/polars_cookbook_index")
)

# %% [markdown]
# With our `index` now available on disk, we’re ready to implement
# `retrieve_polars_knowledge()` – a function that retrieves relevant content
# from the our Polars Cookbook knowledge store based on the user query.

# %%
from llama_index.core import StorageContext, load_index_from_storage

index_polars_cookbook = here("_exercises/16_rag/polars_cookbook_index")
storage_context = StorageContext.from_defaults(persist_dir=index_polars_cookbook)
index = load_index_from_storage(storage_context)


def retrieve_polars_knowledge(query: str) -> list[str]:
    """
    Retrieve relevant content from the Polars Cookbook knowledge store based on
    the user query.

    Parameters
    ----------
    query : str
        The user query to search for relevant Polars knowledge.
    """
    retriever = index.as_retriever(similarity_top_k=5)
    nodes = retriever.retrieve(query)
    return "\n\n".join([f"<excerpt>{x.text}</excerpt>" for x in nodes])


# %% [markdown]
# This particular implementation retrieves the top 5 most relevant documents
# from the `index` based on the user query, but you can adjust the number of
# results by changing the `similarity_top_k` parameter. There’s no magic number
# for this parameter, but `llama-index` defaults to 2, so you may want to
# increase it if you find that the retrieved content is too sparse or not
# relevant enough.
#
# Let's try this out now with a task:

# %%
task = """
How do I find all rows in a DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following DataFrame, which I group by ['Sp','Mt']:

```
Sp Mt Value count
0 MM1 S1 a 2
1 MM1 S1 n **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **5**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
```

Expected output: get the result rows whose count is max in each group, like:

```
1 MM1 S1 n **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **5**
8 MM4 S2 uyi **7**
```
"""

retrieve_polars_knowledge(task)

# %% [markdown]
# Finally, we can plug this retrieval function into a chatlas chatbot. Copy
# the task from the previous block and paste it into the chatbot to see how it
# works!

# %%
chat = chatlas.ChatAuto("openai/gpt-4.1-nano")

chat.register_tool(retrieve_polars_knowledge)

chat.app()

Acknowledgments