llm-join is an R package designed to use Large Language Models (LLMs, such as GPT-5, Claude, DeepSeek, etc.) for fuzzy joining of dataframes. When the key columns of two dataframes have spelling differences, are in different languages, or cannot be matched exactly, llm-join can automatically generate prompts and utilize LLMs to assist in high-quality joining.
You can install the development version of llmjoin from GitHub with:
devtools::install_github("evanliu3594/llmjoin")Please note that all information is stored strictly locally in your system configuration (run
tools::R_user_dir("llmjoin", "config")to see the full path), and is never uploaded or shared.
library(llmjoin)
# OpenAI
set_llm(provider = "openai", key = "your-api-key")
# Claude (Anthropic)
set_llm(provider = "claude", key = "your-api-key")
# Gemini (via OpenAI-compatible endpoint)
set_llm(provider = "gemini", key = "your-api-key")
# Custom endpoint (Ollama, DeepSeek, Kimi, Grok, Qwen, etc.)
set_llm(provider = "openai",
url = "https://your-custom-endpoint/v1/chat/completions",
key = "your-api-key",
model = "your-model-name")Below examples used Deepseek-V4-Flash.
Match numeric month codes to month names — the LLM understands that “01” means January.
x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))
llm_join(x, y, key1 = "id", key2 = "month")
# month id value amount
# 1 Feb 02 20 200
# 2 January 01 10 100
# 3 <NA> 04 40 NAMatch approximate or differently-formatted numeric identifiers — the LLM handles rounding, unit conversion, and format differences.
left <- data.frame(
weight_kg = c(1.0, 2.5, 5.0),
product = c("Widget", "Gadget", "Thing")
)
right <- data.frame(
weight_lb = c("2.2 lb", "5.5 lb", "11 lb"),
price = c(4.99, 9.99, 19.99)
)
llm_join(left, right, key1 = "weight_kg", key2 = "weight_lb")
# weight_lb weight_kg product price
# 1 11 lb 5.0 Thing 19.99
# 2 2.2 lb 1.0 Widget 4.99
# 3 5.5 lb 2.5 Gadget 9.99Match country names to ISO codes — the LLM bridges different naming conventions, languages, and abbreviations.
left <- data.frame(
country = c("China", "United States", "Germany", "日本"),
sales = c(1500, 3200, 2100, 800)
)
right <- data.frame(
code = c("CN", "US", "DE", "JP"),
region = c("Asia", "Americas", "Europe", "Asia")
)
llm_join(left, right, key1 = "country", key2 = "code")
# code country sales region
# 1 CN China 1500 Asia
# 2 DE Germany 2100 Europe
# 3 JP 日本 800 Asia
# 4 US United States 3200 Americasx <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))
joint_prompt(unique(x["id"]), unique(y["month"])) |> writeClipboard()Paste the prompts to ask your LLM model, and copy the answer, going back to R and continue run:
joint <- parse_joint(readr::clipboard(), key1 = "id", key2 = "month")
Reduce(\(x, y) merge(x, y, all.x = TRUE), list(x, joint, y))
# month id value amount
# 1 Feb 02 20 200
# 2 January 01 10 100
# 3 <NA> 04 40 NAMIT License