VPSPulse Mirrors

High-Performance Open-Source Archive

README

joinspy

keys that look identical but aren’t

CRAN status CRAN downloads Monthly downloads R-CMD-check Codecov test coverage License: MIT

Diagnoses join keys at the character level before the join runs, then repairs them.

Most join failures come down to keys that look identical but aren’t: a trailing space, a stray uppercase letter, a zero-width Unicode character. R compares the raw bytes and reports a clean miss. join_spy() inspects the key columns first, names the exact values that won’t match and why, predicts the result size for each join type, and hands you the code to fix them.

library(joinspy)

# inspect the keys before joining
join_spy(orders, customers, by = "customer_id")

# fix whitespace, case, and invisible characters
repaired <- join_repair(orders, customers, by = "customer_id")

What the diagnostic catches

orders <- data.frame(
  id     = c("A", "B ", "c", "D"),
  amount = c(100, 200, 300, 400),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  id   = c("A", "B", "C", "E"),
  name = c("Alice", "Bob", "Carol", "Eve"),
  stringsAsFactors = FALSE
)

join_spy(orders, customers, by = "id")
#> -- Join Diagnostic Report --
#> Match rate (left): 25%
#>
#> Issues Detected:
#>   ! "B " has trailing whitespace (would match "B")
#>   ! "c" vs "C" — case mismatch
#>   x "D" has no match in right table

Each key column is checked for leading and trailing whitespace, case-only differences between the two tables, invisible Unicode (zero-width spaces, BOMs, non-breaking spaces), mixed encodings, empty strings that match each other but not NA, type and factor-level mismatches, floating-point keys that compare unequal, and near-matches within a Levenshtein distance of two for likely typos.

Repair

join_repair() applies the fixes; dry_run = TRUE previews them without touching the data, and suggest_repairs() prints the R code instead of running it, so you can paste it into a script you control.

join_repair(orders, customers, by = "id", dry_run = TRUE)

repaired <- join_repair(orders, customers, by = "id",
                        standardize_case = "upper")

suggest_repairs(join_spy(orders, customers, by = "id"))
#> x$id <- trimws(x$id)
#> x$id <- toupper(x$id)
#> y$id <- toupper(y$id)

Predicting result size

A report also estimates the row count for each join type from the key multiplicities, so a row explosion is visible before it happens:

report <- join_spy(orders, customers, by = "id")
report$expected_rows
#> inner_join: 1
#> left_join:  4
#> right_join: 4
#> full_join:  7

Enforcing cardinality

join_strict() performs the join and holds it to the relationship you declare. Declare 1:1 and it errors if any key is duplicated on either side, before producing a result. The output row count follows from the constraint rather than being discovered afterward.

# left keys must be unique, right may repeat
join_strict(products, line_items, by = "product_id", expect = "1:n")

The levels are "1:1", "1:n", "n:1", and "n:m". detect_cardinality() reports the actual relationship if you want to check first.

Explaining a join that already ran

join_explain() works on the result, accounting for the row count against the two inputs:

result <- merge(orders, customers, by = "id", all.x = TRUE)
join_explain(result, orders, customers, by = "id", type = "left")
#> Result has same row count as left table
#> ! 3 left key(s) have no match in right table

Drop-in join wrappers

left_join_spy(), inner_join_spy(), and the rest run the diagnostic, perform the join, and attach the report as an attribute. check_cartesian() flags many-to-many keys that would multiply the row count, and analyze_join_chain() traces a multi-step A-B-C sequence. Inputs dispatch to their native engine: tibbles use dplyr, data.tables use data.table, plain data frames use base merge(), and the class survives the diagnose-repair-join cycle.

Installation

install.packages("joinspy")            # CRAN

install.packages("pak")                # development version
pak::pak("gcol33/joinspy")

Documentation

Package Focus
dplyr 1.1+ Cardinality checks via relationship argument
powerjoin Configurable join checks and key preprocessing
joyn Per-row match-status reporting variable
tidylog Logs row count changes after joins

joinspy works on the keys themselves: whitespace, case, encoding, typos, and type mismatches. It points to the specific values that failed, says why, and can fix them.

Support

“Software is like sex: it’s better when it’s free.” — Linus Torvalds

I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}

Need mirroring services?
Contact our team at info@vpspulse.com.

Mirror powered by VPSpulse

Infrastructure sponsored by VPSPulse & Secure Payments by ArionPay.