AE 19: Retrieval Augmented Generation (RAG)
Application exercise
R
Python
15_coding-assistant
# Task ------------------------------------------------------------------------
library(ellmer)
# **Step 1:** Run the code below as-is to try the task without any extra
# context. How does the model do? Can you run the function? Does it give you the
# weather? Does it know enough about the {weathR} package to complete the task?
#
# **Step 2:** Now, let's add some context. Head over to GitHub repo for {weathR}
# (link in `docs.R.md`). Copy the project description from the `README.md` and
# paste it into the `docs.r.md` file.
#
# **Step 3:** Uncomment the extra lines to include these docs in the prompt and
# try again.
chat <- chat("anthropic/claude-3-7-sonnet-20250219", echo = "output")
chat$chat(
## Extra context from package docs
# brio::read_file(here::here("_exercises/15_coding-assistant/docs.R.md")),
## Task prompt
paste(
"Write a simple function that takes latitude and longitude as inputs",
"and returns the weather forecast for that location using the {weathR}",
"package. Keep the function concise and simple and don't include error",
"handling or data re-formatting. Include documentation in roxygen2 format,",
"including examples for NYC and Atlanta, GA."
)
)# %% setup
import chatlas
import dotenv
from pyhere import here
dotenv.load_dotenv()
# %% [markdown]
# **Step 1:** Run the code below as-is to try the task without any extra
# context. How does the model do? Can you run the function? Does it give you the
# weather? Does it know enough about the NWS Python package to complete the
# task?
#
# **Step 2:** Now, let's add some context. Head over to the GitHub Repo for NWS
# (link in `docs.py.md`). Copy the project description from the `README.me` and
# paste it into `docs.py.md`.
#
# **Step 3:** Uncomment the extra lines to include these docs in the prompt and
# try again.
# %% task
chat = chatlas.ChatAuto("anthropic/claude-3-7-sonnet-20250219")
chat.chat(
## Extra context from package docs
# here("_exercises/15_coding-assistant/docs.py.md").read_text(),
## Task prompt
"Write a simple function that takes latitude and longitude as inputs "
"and returns the weather forecast for that location using the NWS "
"package. Keep the function concise and simple and don't include error "
"handling or data re-formatting. Include a short docstring, including "
"including examples for NYC and Atlanta, GA.",
)
# %% [markdown]
# Put the result from the model in code block below to try it out.
# %% results
import NWS as weather
# ...16_rag
#+ setup
library(ragnar)
# Step 1: Read, chunk and create embeddings for "R for Data Science" ----------
#' This example is based on https://ragnar.tidyverse.org/#usage.
#'
#' The first step is to crawl the R for Data Science website to find all the
#' pages we'll need to read in.
#'
#' Then, we create a new ragnar document store that will use OpenAI's
#' `text-embedding-3-small` model to create embeddings for each chunk of text.
#'
#' Finally, we read each page as markdown, use `markdown_chunk()` to split that
#' markdown into reasonably-sized chunks, finally inserting each chunk into the
#' vector store. That insertion step automatically sends the chunk text to
#' OpenAI to create the embedding, and ragnar stores the embedding alongside the
#' original text of the chunk.
#+ create-store
base_url <- "https://r4ds.hadley.nz"
pages <- ragnar_find_links(base_url, children_only = TRUE)
store_location <- here::here("_exercises/16_rag/r4ds.ragnar.duckdb")
store <- ragnar_store_create(
store_location,
title = "R for Data Science",
# Need to start over? Set `overwrite = TRUE`.
# overwrite = TRUE,
embed = \(x) embed_openai(x, model = "text-embedding-3-small")
)
cli::cli_progress_bar(total = length(pages))
for (page in pages) {
cli::cli_progress_update(status = page)
chunks <- page |>
read_as_markdown() |>
# The next step breaks the markdown into chunks. This is where you have the
# most control over what content is grouped together for embedding and later
# retrieval. Feel free to experiment with settings in `?markdown_chunk()`.
markdown_chunk()
ragnar_store_insert(store, chunks)
}
cli::cli_progress_done()
ragnar_store_build_index(store)
# Step 2: Inspect your document store -----------------------------------------
#' Now that we have the vector store, what chunks are surfaced when we ask a
#' question? To do that, we'll use the ragnar store inspector app and an
#' example question.
#'
# Here's a question someone might ask an LLM. Copy the task markdown to use in
# the ragnar store inspector app.
#+ inspect-store
task <- r"--(
Could someone help me filter one data frame by matching values in another?
I’ve got two data frames with a common column `code.` I want to keep rows in `data1` where `code` exists in `data2$code`. I tried using `filter()` but got no rows back.
Here’s a minimal example:
```r
library(dplyr)
data1 <- data.frame(
closed_price = c(49900L, 46900L, 46500L),
opened_price = c(51000L, 49500L, 47500L),
adjust_closed_price = c(12951L, 12173L, 12069L),
stock = as.factor(c("AAA", "AAA", "AAC")),
date3 = as.factor(c("2010-07-15", "2011-07-19", "2011-07-23")),
code = as.factor(c("AAA2010", "AAA2011", "AAC2011"))
)
data2 <- data.frame(
code = as.factor(c("AAA2010", "AAC2011")),
ticker = as.factor(c("AAA", "AAM"))
)
```
What I tried:
```r
price_code <- data1 %>% filter(code %in% data2)
```
This returns zero rows. What’s the simplest way to do this?
)--"
ragnar_store_inspect(store)
# Step 3: Use document store in a chatbot --------------------------------------
#' Finally, ragnar provides a special tool that attaches to an ellmer chat
#' client and lets the model retrieve relevant chunks from the vector store on
#' demand. Run the code below to launch a chatbot backed by all the knowledge in
#' the R for Data Science book. Paste the task markdown from above into the chat
#' and see how the chatbot uses the retrieved chunks to improve its answer, or
#' ask it your own questions about R for Data Science.
#+ chatbot
library(ellmer)
chat <- chat(
name = "openai/gpt-4.1-nano",
system_prompt = r"--(
You are an expert R programmer and mentor. You are concise.
Before responding, retrieve relevant material from the knowledge store. Quote or
paraphrase passages, clearly marking your own words versus the source. Provide a
working link for every source you cite.
)--"
)
# Attach the retrieval tool to the chat client. You can choose how many chunks
# or documents are retrieved each time the model uses the tool.
ragnar_register_tool_retrieve(chat, store, top_k = 10)
live_browser(chat)# %%
import chatlas
import dotenv
from pyhere import here
dotenv.load_dotenv()
# %% [markdown]
# Python has a plethora of options for working with knowledge stores
# ([llama-index](https://docs.llamaindex.ai/en/stable/),
# [pinecone](https://docs.pinecone.io/reference/python-sdk), etc.). It doesn’t
# really matter which one you choose, but due to its popularity, maturity, and
# simplicity, lets demonstrate with the
# [`llama-index`](https://docs.llamaindex.ai/en/stable/) library.
#
# With `llama-index`, it’s easy to create a knowledge store from a wide variety
# of input formats, such as text files, [web
# pages](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/),
# and [much more](https://pypi.org/project/llama-index-readers-markitdown/).
#
# For this task, I've downloaded the notebook files in the [Polars
# Cookbook](https://github.com/escobar-west/polars-cookbook) and converted them
# to markdown. This snippet will ingest those markdown files files, embed them,
# and create a vector store `index` that is ready for
# [retrieval](https://posit-dev.github.io/chatlas/misc/RAG.html#retrieve-content).
#
# Creating the vector store index can take a while, so we write it to disk to
# persist between sessions.
# %%
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
polars_cookbook = here("data/polars-cookbook")
docs = SimpleDirectoryReader(polars_cookbook).load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(
persist_dir=here("_exercises/16_rag/polars_cookbook_index")
)
# %% [markdown]
# With our `index` now available on disk, we’re ready to implement
# `retrieve_polars_knowledge()` – a function that retrieves relevant content
# from the our Polars Cookbook knowledge store based on the user query.
# %%
from llama_index.core import StorageContext, load_index_from_storage
index_polars_cookbook = here("_exercises/16_rag/polars_cookbook_index")
storage_context = StorageContext.from_defaults(persist_dir=index_polars_cookbook)
index = load_index_from_storage(storage_context)
def retrieve_polars_knowledge(query: str) -> list[str]:
"""
Retrieve relevant content from the Polars Cookbook knowledge store based on
the user query.
Parameters
----------
query : str
The user query to search for relevant Polars knowledge.
"""
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(query)
return "\n\n".join([f"<excerpt>{x.text}</excerpt>" for x in nodes])
# %% [markdown]
# This particular implementation retrieves the top 5 most relevant documents
# from the `index` based on the user query, but you can adjust the number of
# results by changing the `similarity_top_k` parameter. There’s no magic number
# for this parameter, but `llama-index` defaults to 2, so you may want to
# increase it if you find that the retrieved content is too sparse or not
# relevant enough.
#
# Let's try this out now with a task:
# %%
task = """
How do I find all rows in a DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame, which I group by ['Sp','Mt']:
```
Sp Mt Value count
0 MM1 S1 a 2
1 MM1 S1 n **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **5**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
```
Expected output: get the result rows whose count is max in each group, like:
```
1 MM1 S1 n **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **5**
8 MM4 S2 uyi **7**
```
"""
retrieve_polars_knowledge(task)
# %% [markdown]
# Finally, we can plug this retrieval function into a chatlas chatbot. Copy
# the task from the previous block and paste it into the chatbot to see how it
# works!
# %%
chat = chatlas.ChatAuto("openai/gpt-4.1-nano")
chat.register_tool(retrieve_polars_knowledge)
chat.app()Acknowledgments
- Materials derived in part from Programming with LLMs and licensed under a Creative Commons Attribution 4.0 International (CC BY) License.