High-Performance Open-Source Archive
The ebrahim.gof package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test.
Goodness-of-fit testing is crucial in logistic regression to assess whether the fitted model adequately describes the data. The most commonly used test is the Hosmer-Lemeshow test, but it has several limitations:
The Ebrahim-Farrington test addresses these limitations by using a modified Pearson chi-square statistic based on Farrington’s (1996) theoretical framework, but simplified for practical implementation with binary data.
The main function ef.gof() performs the goodness-of-fit
test:
# Simulate binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- plogis(linpred) # Convert to probabilities
y <- rbinom(n, 1, prob)
# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)
# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
#> Test Test_Statistic p_value
#> 1 Ebrahim-Farrington -1.250567 0.9344997For binary data with automatic grouping, the Ebrahim-Farrington test statistic is:
\[Z_{EF} = \frac{T_{EF} - (G - 2)}{\sqrt{2(G-2)}}\]
Where: - \(T_{EF}\) is the modified Pearson chi-square statistic - \(G\) is the number of groups - The test statistic follows a standard normal distribution under \(H_0\)
The null hypothesis is that the model fits the data adequately.
The number of groups \(G\) can affect the test’s performance:
# Test with different numbers of groups
group_sizes <- c(4, 8, 10, 15, 20)
results <- data.frame(
Groups = group_sizes,
P_value = sapply(group_sizes, function(g) {
ef.gof(y, predicted_probs, G = g)$p_value
})
)
print(results)
#> Groups P_value
#> 1 4 0.7449740
#> 2 8 0.3317745
#> 3 10 0.9344997
#> 4 15 0.7347885
#> 5 20 0.3532473Let’s compare the Ebrahim-Farrington test with the traditional Hosmer-Lemeshow test:
# Hosmer-Lemeshow test (requires ResourceSelection package)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
library(ResourceSelection)
# Perform both tests
ef_result <- ef.gof(y, predicted_probs, G = 10)
hl_result <- hoslem.test(y, predicted_probs, g = 10)
# Compare results
comparison <- data.frame(
Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
P_value = c(ef_result$p_value, hl_result$p.value),
Test_Statistic = c(ef_result$Test_Statistic, hl_result$statistic)
)
print(comparison)
} else {
cat("ResourceSelection package not available for comparison\n")
}
#> ResourceSelection 0.3-6 2023-06-27
#> Test P_value Test_Statistic
#> Ebrahim-Farrington 0.9344997 -1.250567
#> X-squared Hosmer-Lemeshow 0.9431075 2.855296Version 2.0.0 turns the package into a full goodness-of-fit toolkit.
The Directed EF (DEF) test concentrates power on
calibration-curve shape directions, def.ensemble.gof()
combines the DEF bases via the Cauchy combination test, and
run.all.gof() runs a whole battery of GOF tests at
once.
# Directed Ebrahim-Farrington test (takes the fitted model)
def.gof(model) # default poly3 basis
#> Test Basis Test_Statistic df Method
#> 1 Directed Ebrahim-Farrington poly3 0.1333555 2.058835 satterthwaite
#> p_value
#> 1 0.9394923
def.gof(model, basis = "ensemble") # combine all three bases (Cauchy)
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel 3 0.8921128
# Ensemble of the three DEF bases
def.ensemble.gof(model)
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel 3 0.8921128
def.ensemble.gof(model, add_ef = TRUE) # add the omnibus EF
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel+EF 4 0.9070102run.all.gof() returns one tidy data frame, one row per
test:
run.all.gof(model)
#> Test Family Statistic df p_value
#> 1 Pearson Global 500.189779915 498.000000 0.46398355
#> 2 Deviance Global 548.361194487 498.000000 0.05868791
#> 3 Osius-Rojek Standardized 0.124150785 NA 0.90119589
#> 4 Copas-RSS Standardized 0.001485027 1.000000 0.99881512
#> 5 Information-Matrix Global 0.017967854 2.000000 0.99105631
#> 6 HL Partition 2.855296455 8.000000 0.94310750
#> 7 HL-equalwidth Partition 4.860031216 8.000000 0.77242612
#> 8 Pigeon-Heyse Partition 2.877438508 9.000000 0.96895634
#> 9 EF Standardized -1.250566694 8.000000 0.93449972
#> 10 EF-normal Standardized -1.250566694 8.000000 0.89445370
#> 11 DEF.poly2 Directed 0.061762126 1.076818 0.82265331
#> 12 DEF.poly3 Directed 0.133355497 2.058835 0.93949233
#> 13 DEF.stukel Directed 0.319113754 2.002282 0.83134362
#> 14 Stukel Directed 0.030567470 2.000000 0.98483247
#> 15 Tsiatis Covariate-space 5.778041795 9.000000 0.76191087
#> 16 Xie Covariate-space 8.311608210 8.500000 0.45352565
#> 17 Pulkstenis-Robinson Covariate-space NA NA NA
#> 18 Ensemble.Vote(3DEF) Ensemble NA NA 0.89211275
#> 19 Ensemble.Univ(3DEF+EF) Ensemble NA NA 0.90701023
#> Note
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10 normal reference (thesis)
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17 no categorical covariate
#> 18 CCT
#> 19 CCTAdd include_slow = TRUE to also run the opt-in slow
tests (le Cessie, the GAM-based tests, Stute-Zhu, eHL, BAGofT, and the
Lai & Liu standardized-power test), or pass
tests = c("EF", "DEF.poly3", "HL") to run a chosen
subset.
Most GOF tests for logistic regression are
partition-based (they group the data and compare
observed with expected counts), and that is the family
ef.gof() and def.gof() belong to. A key
property of the directed tests is that they gain power
without inflating the type I error rate. In a Monte
Carlo study (n = 500, 1000 replications, α = 0.05), the partition tests
compare as follows:
| Test | Size (null) | Power: quadratic | Power: wrong link |
|---|---|---|---|
| Hosmer–Lemeshow (decile) | 0.060 | 0.588 | 0.179 |
| Hosmer–Lemeshow (equal-width) | 0.053 | 0.332 | 0.244 |
| Pigeon–Heyse | 0.035 | 0.535 | 0.133 |
| EF (omnibus) | 0.058 | 0.480 | 0.218 |
| Tsiatis | 0.056 | 0.574 | 0.162 |
| Xie | 0.042 | 0.557 | 0.147 |
| DEF (poly3) | 0.060 | 0.709 | 0.404 |
| DEF (ensemble, vote) | 0.066 | 0.767 | 0.468 |
DEF and its vote ensemble are the most powerful in the family while keeping the size near the nominal 0.05 — they are not liberal — and they roughly double the power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit.
Partition-based tests are intuitive and work for sparse data and continuous covariates (where the Pearson and deviance chi-square tests fail), but their result depends on the grouping choice and the simpler members (HL) can have low power. DEF keeps the intuitive fitted-probability grouping but directs the test at calibration-curve shapes, which is why it tops the table without losing size control.
Note: as of 2.0.0,
ef.gof()defaults to the chi-square reference (method = "chisq"); usemethod = "normal"for the version 1.0.0 behaviour.
Let’s examine the power of the test to detect model misspecification:
# Function to simulate power under model misspecification
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100, G = 10) {
rejections_ef <- 0
rejections_hl <- 0
for (i in 1:n_sims) {
# Generate data with quadratic term (true model)
x <- runif(n, -2, 2)
linpred_true <- 0 + x + beta_quad * x^2
prob_true <- plogis(linpred_true)
y <- rbinom(n, 1, prob_true)
# Fit misspecified linear model (omitting quadratic term)
model_mis <- glm(y ~ x, family = binomial())
pred_probs <- fitted(model_mis)
# Ebrahim-Farrington test
ef_test <- ef.gof(y, pred_probs, G = G)
if (ef_test$p_value < 0.05) rejections_ef <- rejections_ef + 1
# Hosmer-Lemeshow test (if available)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
hl_test <- ResourceSelection::hoslem.test(y, pred_probs, g = G)
if (hl_test$p.value < 0.05) rejections_hl <- rejections_hl + 1
}
}
power_ef <- rejections_ef / n_sims
power_hl <- if (requireNamespace("ResourceSelection", quietly = TRUE)) {
rejections_hl / n_sims
} else {
NA
}
return(list(power_ef = power_ef, power_hl = power_hl))
}
# Calculate power for different sample sizes
sample_sizes <- c(200, 500, 1000)
power_results <- data.frame(
n = sample_sizes,
EbrahimFarrington_Power = sapply(sample_sizes, function(n) {
simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_ef
})
)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
power_results$HosmerLemeshow_Power <- sapply(sample_sizes, function(n) {
simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_hl
})
}
print(power_results)
#> n EbrahimFarrington_Power HosmerLemeshow_Power
#> 1 200 0.06 0.12
#> 2 500 0.14 0.14
#> 3 1000 0.20 0.22For datasets with grouped observations (multiple trials per covariate pattern), you can use the original Farrington test:
# Simulate grouped data
set.seed(456)
n_groups <- 30
m_trials <- sample(5:20, n_groups, replace = TRUE)
x_grouped <- rnorm(n_groups)
prob_grouped <- plogis(0.2 + 0.8 * x_grouped)
y_grouped <- rbinom(n_groups, m_trials, prob_grouped)
# Create data frame and fit model
data_grouped <- data.frame(
successes = y_grouped,
trials = m_trials,
x = x_grouped
)
model_grouped <- glm(
cbind(successes, trials - successes) ~ x,
data = data_grouped,
family = binomial()
)
predicted_probs_grouped <- fitted(model_grouped)
# Original Farrington test for grouped data
result_grouped <- ef.gof(
y_grouped,
predicted_probs_grouped,
model = model_grouped,
m = m_trials,
G = NULL # No automatic grouping for original test
)
print(result_grouped)
#> Test Test_Statistic p_value
#> 1 Farrington-Original -1.476122 0.9300444G specified):
m provided,
G = NULL):
Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. Journal of the Royal Statistical Society. Series B (Methodological), 58(2), 349-360.
Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. Master’s Thesis, Alexandria University.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.
Hosmer, D. W., & Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics - Theory and Methods, 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941
The Ebrahim-Farrington test provides a powerful and practical tool for assessing goodness-of-fit in logistic regression, particularly for binary data and sparse datasets. Its simplified implementation makes it accessible for routine use while maintaining strong theoretical foundations.
Need mirroring services?
Contact our team at info@vpspulse.com.
Mirror powered by VPSpulse
Infrastructure sponsored by VPSPulse & Secure Payments by ArionPay.