Comparing Quantiles of Two Groups with Bootstrap

Zeynel Cebeci, A. Firat Ozdemir, Engin Yildiztepe

5 May 2025

1 Introduction
2 Required Packages and Data Sets
- 2.1 Install and load the package gpcomp
- 2.2 Data sets
3 Data Visualization
4 Resampling Statistics with Bootstrap
5 Results and Interpretation
6 Other Tests

1 Introduction

This vignette serves as a comprehensive guide to compare quantiles of two groups data with bootstrap using the groupcompare package in R. Bootstrap is a powerful statistical technique that involves repeatedly resampling a dataset to estimate the sampling distribution of a statistic. It is particularly useful for assessing the accuracy and variability of estimates when the underlying distribution is unknown.

In experiments involving two groups, it is often essential to compare their distributions to determine if there are significant differences. Conventional parametric tests may not always be appropriate, especially when the data does not meet the assumptions of normality or homogeneity of variances. In such cases, non-parametric approaches like the bootstrap provide a robust alternative.

Quantiles are specific points in a dataset that divide the data into equal intervals, such as the median (50th percentile) or quartiles (25th and 75th percentiles). Comparing the quantiles of two groups can reveal differences in their central tendency, spread, and overall distribution. This vignette demonstrates how to implement bootstrap methods for quantile comparison using the bootstrap function of groupcompare package in R. By following the outlined steps, users will be able to perform rigorous statistical analyses and draw meaningful conclusions about their data.

2 Required Packages and Data Sets

2.1 Install and load the package gpcomp

The recent version of the package from CRAN is installed with the following command:

install.packages("groupcompare", dep=TRUE)

If you have already installed ‘groupcompare’, you can load it into R working environment by using the following command:

library("groupcompare")

2.2 Data sets

The dataset to be analyzed can be in wide format where the values for Group 1 and Group 2 are written in two separate columns or in long format where the values are entered in the first column and the group names are entered in the second column. In the following code chunk, a dataset named ds1 is created using the ghdist function to simulate the G&H distribution. The generated dataset contains data for two groups named A and B, each consisting of 25 observations, with a mean of 50 and a standard deviation of 2. In the example, by assigning zeros to the skewness (g) and kurtosis (h) arguments, the simulated data is intended to have a normal distribution. As expected, the means and variances of groups A and B are done equal. In the example, the generated dataset is in wide format, and immediately after, it is converted to long format using the wide2long function to create the dataset ds2. This provides an idea of the long data format, and as can be seen, in the long data format, the first column contains the observation values, while the second column contains the group names or codes. As understood from the example, different groups can be created by changing the means, variances, skewness, and kurtosis parameters.

 set.seed(12) # For reproducibility purpose
 grp1 <- ghdist(50, 50, 2, g=0, h=0)
 grp2 <- ghdist(50, 45, 4, g=0.8, h=0)
 ds1 <- data.frame(grp1=grp1, grp2=grp2)
 head(ds1)

##       grp1     grp2
## 1 47.03886 44.83214
## 2 53.15434 44.56903
## 3 48.08651 47.20590
## 4 48.15999 65.17133
## 5 46.00472 42.15702
## 6 49.45541 48.99942

 # Data in long format
 ds2 <- wide2long(ds1)
 head(ds2)

##        obs group
## 1 47.03886  grp1
## 2 53.15434  grp1
## 3 48.08651  grp1
## 4 48.15999  grp1
## 5 46.00472  grp1
## 6 49.45541  grp1

3 Data Visualization

For statistical tests, data visualization is performed before the analysis to provide insights about the structure or distribution of the data. In the comparison of two groups using parametric tests such as the t-test, visualization provides preliminary information on whether the assumptions of the test are met. The bivarplot function in the following code chunk facilitates the examination and comparison of group data using various plots.

bivarplot(ds2)

4 Resampling Statistics with Bootstrap

The bootstrap function of the package, given an example of its usage in the following code chunk, compares the groups in the dataset using percentiles and returns the results. Each item in the result object is a data frame containing the confidence limits related to two compared groups.

results <- bootstrap(ds2, statistic=calcquantdif, alpha=0.05, R=300)
str(results)

## List of 3
##  $ P25: num [1:4, 1:2] 3.79 3.64 4 3.94 6.01 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "normal" "basic" "percent" "bca"
##   .. ..$ : chr [1:2] "lower" "upper"
##  $ P50: num [1:4, 1:2] 2.96 3.36 2.63 2.63 6.04 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "normal" "basic" "percent" "bca"
##   .. ..$ : chr [1:2] "lower" "upper"
##  $ P75: num [1:4, 1:2] -1.3612 -1.8858 0.0336 -0.1663 3.6686 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "normal" "basic" "percent" "bca"
##   .. ..$ : chr [1:2] "lower" "upper"

results

## $P25
##            lower    upper
## normal  3.793402 6.011652
## basic   3.642290 5.807309
## percent 3.997745 6.162763
## bca     3.939585 5.906732
## 
## $P50
##            lower    upper
## normal  2.955944 6.040281
## basic   3.362186 6.364663
## percent 2.631563 5.634040
## bca     2.628061 5.618521
## 
## $P75
##               lower    upper
## normal  -1.36120314 3.668603
## basic   -1.88579802 2.273762
## percent  0.03363719 4.193197
## bca     -0.16627292 3.887789

In the code chunk above, ds2 is the name of dataset in long data format in which the group names locate in the second column. As a mandatory argumant, the statistic is the name of the function that calculates and returns the quantile differences of the groups are being compared. In the example, calcquantdif is a function that calculates and returns the differences between group quantiles. Among its arguments, alpha shows the Type I error level, and R shows the number of repetitions for bootstrap.

5 Results and Interpretation

In each data frame in the results object in the output above, the normal, basic, percentile and bca stand for types of the condifence intervals computead using the methods of Normal, Basic, Percentile and BCa, respectively.

In the above example, the list of confidence intervals calculated for various percentile differences is converted into a data frame as shown in the example below. This makes easy to check the results and also to prepare the output for a probale report.

# Arrange the results as a data frame
ci2df(results)

##             normal          basic       percent            bca
## P25  [3.793,6.012]  [3.642,5.807] [3.998,6.163]   [3.94,5.907]
## P50   [2.956,6.04]  [3.362,6.365] [2.632,5.634]  [2.628,5.619]
## P75 [-1.361,3.669] [-1.886,2.274] [0.034,4.193] [-0.166,3.888]

Indeed, determining which confidence interval to use is crucial for interpreting the results accurately.

Larger sample sizes generally provide more reliable estimates and narrower confidence intervals.
Understanding the distribution of your data helps in choosing appropriate confidence intervals. For instance, if the data is normally distribute, the CIs computed with normal method is recommended.
Depending on whether the aim for a broad overview or a detailed analysis, one can might choose wider or narrower confidence intervals.

6 Other Tests

In addition to bootstrap, performing permutation tests can be useful for validation of the results when comparing the quantiles of two groups. For this purpose, the permtest function in the groupcompare package can be used. For details, you can refer to the usage documentation of the package as well as the vignette titled Quantiles Comparison of Two Groups with Permutation Tests.

While confidence intervals for the difference of percentiles have been calculated here, significance differences for other group statistics can be determined by passing different function names to the statistic argument. For example, when the calcstatdif function is assigned to the statistic argument in the example above, bootstrap confidence intervals can be calculated for the differences between the means, medians, IQRs, and variances of the two groups.