High-Performance Open-Source Archive
keys that look identical but aren’t
Diagnoses join keys at the character level before the join runs, then repairs them.
Most join failures come down to keys that look identical but aren’t:
a trailing space, a stray uppercase letter, a zero-width Unicode
character. R compares the raw bytes and reports a clean miss.
join_spy() inspects the key columns first, names the exact
values that won’t match and why, predicts the result size for each join
type, and hands you the code to fix them.
library(joinspy)
# inspect the keys before joining
join_spy(orders, customers, by = "customer_id")
# fix whitespace, case, and invisible characters
repaired <- join_repair(orders, customers, by = "customer_id")orders <- data.frame(
id = c("A", "B ", "c", "D"),
amount = c(100, 200, 300, 400),
stringsAsFactors = FALSE
)
customers <- data.frame(
id = c("A", "B", "C", "E"),
name = c("Alice", "Bob", "Carol", "Eve"),
stringsAsFactors = FALSE
)
join_spy(orders, customers, by = "id")
#> -- Join Diagnostic Report --
#> Match rate (left): 25%
#>
#> Issues Detected:
#> ! "B " has trailing whitespace (would match "B")
#> ! "c" vs "C" — case mismatch
#> x "D" has no match in right tableEach key column is checked for leading and trailing whitespace,
case-only differences between the two tables, invisible Unicode
(zero-width spaces, BOMs, non-breaking spaces), mixed encodings, empty
strings that match each other but not NA, type and
factor-level mismatches, floating-point keys that compare unequal, and
near-matches within a Levenshtein distance of two for likely typos.
join_repair() applies the fixes;
dry_run = TRUE previews them without touching the data, and
suggest_repairs() prints the R code instead of running it,
so you can paste it into a script you control.
join_repair(orders, customers, by = "id", dry_run = TRUE)
repaired <- join_repair(orders, customers, by = "id",
standardize_case = "upper")
suggest_repairs(join_spy(orders, customers, by = "id"))
#> x$id <- trimws(x$id)
#> x$id <- toupper(x$id)
#> y$id <- toupper(y$id)A report also estimates the row count for each join type from the key multiplicities, so a row explosion is visible before it happens:
report <- join_spy(orders, customers, by = "id")
report$expected_rows
#> inner_join: 1
#> left_join: 4
#> right_join: 4
#> full_join: 7join_strict() performs the join and holds it to the
relationship you declare. Declare 1:1 and it errors if any
key is duplicated on either side, before producing a result. The output
row count follows from the constraint rather than being discovered
afterward.
# left keys must be unique, right may repeat
join_strict(products, line_items, by = "product_id", expect = "1:n")The levels are "1:1", "1:n",
"n:1", and "n:m".
detect_cardinality() reports the actual relationship if you
want to check first.
join_explain() works on the result, accounting for the
row count against the two inputs:
result <- merge(orders, customers, by = "id", all.x = TRUE)
join_explain(result, orders, customers, by = "id", type = "left")
#> Result has same row count as left table
#> ! 3 left key(s) have no match in right tableleft_join_spy(), inner_join_spy(), and the
rest run the diagnostic, perform the join, and attach the report as an
attribute. check_cartesian() flags many-to-many keys that
would multiply the row count, and analyze_join_chain()
traces a multi-step A-B-C sequence. Inputs dispatch to their native
engine: tibbles use dplyr, data.tables use data.table, plain data frames
use base merge(), and the class survives the
diagnose-repair-join cycle.
install.packages("joinspy") # CRAN
install.packages("pak") # development version
pak::pak("gcol33/joinspy")| Package | Focus |
|---|---|
| dplyr 1.1+ | Cardinality checks via relationship argument |
| powerjoin | Configurable join checks and key preprocessing |
| joyn | Per-row match-status reporting variable |
| tidylog | Logs row count changes after joins |
joinspy works on the keys themselves: whitespace, case,
encoding, typos, and type mismatches. It points to the specific values
that failed, says why, and can fix them.
“Software is like sex: it’s better when it’s free.” — Linus Torvalds
I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.
If this package saved you some time, buying me a coffee is a nice way to say thanks.
MIT (see the LICENSE.md file)
@software{joinspy,
author = {Colling, Gilles},
title = {joinspy: Diagnostic Tools for Data Frame Joins},
year = {2025},
url = {https://github.com/gcol33/joinspy}
}
Need mirroring services?
Contact our team at info@vpspulse.com.
Mirror powered by VPSpulse
Infrastructure sponsored by VPSPulse & Secure Payments by ArionPay.