% \VignetteEngine{knitr::knitr}
% \VignetteEncoding{UTF-8}
% \VignetteIndexEntry{Introduction to ReplicationSuccess}
% \VignetteDepends{knitr}

\documentclass[a4paper, 11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{graphics}
\usepackage{amsmath, amssymb}
 \usepackage[round]{natbib}
\bibliographystyle{apalikedoiurl}
\usepackage{doi}
\usepackage{color}
\newcommand{\hl}{\textcolor{black}}
\newcommand{\soutr}[1]{\textcolor{red}{\xout {\textcolor{black}{#1}}}}
%\usepackage{todonotes}

%% do not reload the xcolor package, otherwise the colors defined by knitr
%% are purged
% \usepackage[dvipsnames]{xcolor}

%% could alternatively load xcolor and redefine knitr color (however, not robust
%% to changes of knitr)
% \definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345}

% own commands
%% custom commands
\newcommand{\eg}{\textit{e.\,g.\,}} % e.g.
\newcommand{\ie}{\textit{i.\,e.\,}} % i.e.
\newcommand{\cf}{\textit{cf\,}} % cf
\DeclareMathOperator{\Nor}{N} % Normal
\DeclareMathOperator{\Var}{Var} % Variance
\DeclareMathOperator{\E}{\mathsf{E}} % Expectation
\DeclareMathOperator{\Cov}{Cov} % Covariance
\DeclareMathOperator{\Corr}{Corr} % Correlation
\DeclareMathOperator{\se}{se} % Standard error
\newcommand{\dmin}{d_{\tiny \mbox{min}}}

% margins %
\usepackage[a4paper, total={6.5in, 10in}]{geometry}

% title, author, date, etc.
\title{\textbf{Design and Analysis of Replication Studies with \texttt{ReplicationSuccess}}}
\author{\textbf{Leonhard Held, Charlotte Micheloud, Samuel Pawel, Felix Hofmann, Florian Gerber} \\
Epidemiology, Biostatistics and Prevention Institute (EBPI) \\
Center for Reproducible Science (CRS) \\
University of Zurich, Switzerland}
\date{Package version \Sexpr{packageVersion(pkg = "ReplicationSuccess")}}

% hyperref options
\definecolor{darkcoral}{rgb}{0.8, 0.36, 0.27}
\definecolor{darkcyan}{rgb}{0.0, 0.55, 0.55}
\usepackage{hyperref}
\hypersetup{
  bookmarksopen=true,
  breaklinks=true,
  pdftitle={ReplicationSuccess},
  pdfauthor={Leonhard Held, Charlotte Micheloud, Samuel Pawel},
  colorlinks=true,
  linkcolor=darkcoral,
  anchorcolor=black,
  citecolor=darkcyan,
  urlcolor=darkcoral,
}

\begin{document}

<< "knitr-options", echo = FALSE >>=
library(knitr)
opts_chunk$set(size = "small",
               fig.height = 4,
               fig.align = "center",
               cache = FALSE,
               message = FALSE,
               warning = FALSE)

@

\maketitle


\begin{center}
\begin{minipage}{0.65\textwidth}
  \rule{\textwidth}{0.4pt}
  {\small
  \centering \textbf{Abstract} \\
  This vignette provides an overview of the \textsf{R} package
  \texttt{ReplicationSuccess}. The package contains utilities for the design and
  analysis of replication studies. Traditional methods based on statistical
  significance and confidence intervals,
  % as well as more recently developed methods such as the sceptical $p$-value
  % \citep{Held2020} are included.
  as well as the sceptical $p$-value \citep{Held2020}
  are included.
  The functionality of the package is illustrated using data sets from four
  large-scale replication projects which come also with the package.
  }
  \rule{\textwidth}{0.4pt}
\end{minipage}
\end{center}

\section{Introduction}
Over the course of the last decade, the conduct of replication studies
has increased substantially. These developments were mainly caused by
the so-called ``replication crisis'' in the social and life sciences.
However, there is no consensus which statistical approach
should be used to assess whether a replication study successfully
replicated an original discovery. Moreover, depending on the chosen
approach for analysis, the statistical considerations in the design of the
replication study differ.

The \textsf{R} package \texttt{ReplicationSuccess} provides
functionality to analyse and plan replication studies in several ways.
Specifically, functions for analysis, power and samples size calculations based
% on statistical significance, as well as based on more recent methods, such as
% the sceptical $p$-value \citep{Held2020}, are included.
on statistical significance and confidence intervals, as well as on more recent
methods, such as the sceptical $p$-value \citep{Held2020}, are included.
This vignette illustrates the usage of the package on the data sets
from four large-scale replication projects which are also included in the
package.

\texttt{ReplicationSuccess} was first created to provide software for computing
the sceptical $p$-value and related power and sample size calculations
\citep{Held2020}. Methods to compute the $p$-value for intrinsic
credibility \citep{Held2019a}, the harmonic mean $\chi^2$-test \citep{Held2020b},
forecasting of replication studies \citep{Pawel2020}, and interim analyses of
replication studies \citep{MicheloudHeld2021} were added subsequently. Recently, substantial
changes to many of the existing functions were made due to two recalibrations of
the sceptical $p$-value approach \citep{Held_etal2022, Held_etal2022-Fram}.
Use \texttt{news(package = "ReplicationSuccess")} to see a history of the
changes. The development version of \texttt{ReplicationSuccess} is available on
GitHub (\url{https://github.com/crsuzh/ReplicationSuccess/}, the stable version
is available on CRAN (\url{https://cran.r-project.org/web/packages/ReplicationSuccess/}).

\subsection{Statistical framework}
\texttt{ReplicationSuccess} assumes a simple but general statistical framework:
The (suitably transformed) effect estimates $\hat{\theta}_o$ and $\hat{\theta}_r$
from original (subscript $o$) and replication (subscript $r$) study are assumed
to be normally distributed around the unknown effect size $\theta$.
Their variances are equal to their squared standard
errors $\sigma_o^2$ and $\sigma_r^2$ which are assumed to be known.
The same framework is also common in meta-analysis and can for example be
applied to mean differences, odds ratios (log transformation), or correlation
coefficients (Fisher $z$-transformation).

Many of the functions in the package take unitless quantities as input:
the $z$-values $z_o = \hat{\theta}_o/\sigma_o$ and $z_r = \hat{\theta}_r/\sigma_r$,
the relative effect size $d = \hat{\theta}_r/\hat{\theta}_o$
(or shrinkage $s = 1 - d$), and the variance ratio $c = \sigma^2_o/\sigma^2_r$.
The squared standard errors are usually inversely proportional to
the sample size of each study, $\sigma^2_o = \kappa^2/n_o$ and
$\sigma^2_r = \kappa^2/n_r$ for some unit variance $\kappa^2$. The variance
ratio can then be identified as the relative sample size $c = n_r/n_o$.
This may require some adaptation for certain types of effect sizes.
Computation of these quantities from real data will be illustrated below.



\section{Data sets}
\texttt{ReplicationSuccess} includes data from four replication projects with
a ``one-to-one'' design (\ie one replication for one original study), and from
one replication project with multiple replications for one original study.
The four replication projects with a one-to-one design are:

\begin{itemize}
\item \textbf{Reproducibility Project: Psychology:}
In the \emph{Reproducibility Project: Psychology} 100 replications of studies from
the field of psychology were conducted \citep{Opensc2015}.
The original studies were published in three major Psychology journals in the year
2008. Only the study pairs of the ``meta-analytic subset'' are included here,
which consists of 73 studies where the standard error of the Fisher $z$-transformed
effect estimates can be computed \citep{Johnson2016}.

\item \textbf{Experimental Economics Replication Project:}
This project attempted to replicate 18 experimental economics studies published
between 2011 and 2015 in two high impact economics journals \citep{Camerer2016}.
For this project a \emph{prediction market} was also conducted in order to estimate
the peer beliefs about whether a replication will result in a statistically
significant result. Prediction markets are a tool to aggregate beliefs of market
participants regarding the possibility of an investigated outcome and they have
been used successfully in numerous domains, \eg sports and politics
\citep{Dreber2015}. The estimated peer beliefs are also included for each study
pair.

\item \textbf{Social Sciences Replication Project:}
This project involved 21 replications of studies on the social sciences
published in the journals \emph{Nature} and \emph{Science} between 2010 and 2015
\citep{Camerer2018}. As in the experimental economics replication project,
a prediction market to estimate peer beliefs about the replicability of the
original studies was conducted and the resulting belief estimates are also
provided in the package. In this project, the replications were conducted
in two stages. In stage 1, the replication studies had 90\% power to detect
75\% of the original effect estimate. Data collection was stopped if a
two-sided $p$-value $< 0.05$ and an effect in the same direction as the original
were found. If not, data collection was continued in stage 2 to have 90\%
power to detect 50\% of the original effect size for the first and second
data collection pooled.

\item \textbf{Experimental Philosophy Replicability Project:}
In this project, 40 replications of experimental philosophy studies were
carried out. The original studies had to be  published between 2003 and 2015
in one of 35 journals in which experimental philosophy research is usually
published (a list defined by the coordinators of this project) and they had
to be listed on the experimental philosophy page of the Yale university
\citep{Cova2018}. The data from the subset of 31 study pairs where effect
estimates on correlation scale as well as effective sample size for both
the original and replication were available are included in the package.
\end{itemize}

In these data sets, effect estimates are provided as correlation coefficients
($r$), as well as Fisher $z$-transformed correlation coefficients
($\hat{\theta} = \text{tanh}^{-1}(r)$). In the descriptive analysis of data from
replication projects it has become common practice to transform effect sizes to
the correlation scale, because correlations are bounded to the interval between
minus one and one and thus easy to compare and interpret. Design and statistical
analysis, on the other hand, is then usually carried out on a scale where the
distribution of the estimates is well approximated by a normal distribution.
For correlation coefficients this is the case after applying the Fisher
$z$-transformation, which leads to their variance asymptotically being only a
function of the study sample size $n$, \ie $\Var(\hat{\theta}) = 1/(n - 3)$
\citep{Fisher1921}.


The data can be loaded with the command \texttt{data("RProjects")}. For a
description of the variables see the documentation with \texttt{?RProjects}.
An extended version of the Social Sciences Replication Project including the
details of stages one and two can be loaded with \texttt{data("SSRP")}.
It is a good idea to first compute the unitless quantities $z_o$, $z_r$ and
$c$, since most functions of the package use them as input.
% We also use the function \texttt{z2p} to compute the one-sided
% $p$-values for original and replication study. As all original estimates are
% positive, we specify the argument \texttt{alternative} to \texttt{"greater"}.

<< "data-loading" >>=
library(ReplicationSuccess)
data("RProjects")
str(RProjects, width = 80, strict.width = "cut")

## computing zo, zr, c
RProjects$zo <- with(RProjects, fiso/se_fiso)
RProjects$zr <- with(RProjects, fisr/se_fisr)
RProjects$c <- with(RProjects, se_fiso^2/se_fisr^2)
@

Note that each variable ending with an \texttt{o} is associated with the original
study,
while each variable ending with an \texttt{r} is associated with the replication
study.
Plotting the original versus the replication effect estimate on the correlation scale
paints the following picture.

<< "plot-projects", fig.height = 6 >>=
## plots of effect estimates

par(mfrow = c(2, 2), las = 1, pty = "s",
    mar = c(4, 2.5, 2.5, 1))
for (p in unique(RProjects$project)) {
    data_project <- subset(RProjects, project == p)
    significant <- ifelse(data_project$pr1 < 0.025, "#DF536BCC", "#00000080")
    plot(rr ~ ro, data = data_project, ylim = c(-0.5, 1), col = significant,
         xlim = c(-0.5, 1), main = p, xlab = expression(italic(r)[o]),
         cex = 0.7, pch = 19, ylab = expression(italic(r)[r]))
    legend("topleft", legend = "replication significant", cex = 0.8, pch = 20,
           col = "#DF536BCC", bty = "n")
    abline(h = 0, lty = 2)
    abline(a = 0, b = 1, col = "grey")
}
@

In most cases the replication estimate is smaller than the corresponding
original estimate. Furthermore, a substantial number of the replication estimates
does not achieve statistical significance at one-sided 2.5\% level, while almost all original
estimates did.

The fifth data set comes from the following project:
\begin{itemize}
\item \textbf{Protzko et al. (2020)}:
This data set originates from a prospective replication project involving four
laboratories \citep{Protzko2020}.
Each of them conducted four original studies and for each original
study a replication study was carried out within the same lab (self-replication)
and by the other three labs (external-replication). Most studies used simple
between-subject designs with two groups and a continuous outcome so that for
each study, an estimate of the standardized mean difference (SMD) could be
computed from the group means, group standard deviations, and group sample
sizes. For studies with covariate adjustment and/or binary outcomes, effect
size transformations as described in the supplementary material of \citet{Protzko2020}
were used to obtain effect estimates and standard errors on SMD scale.
\end{itemize}
%
The dataset can be loaded with the command \texttt{data("protzko2020")}.
The forest plots of the effect estimates for the first four experiments looks as
follows.

<<forestPlots, fig.height = 5>>=
data("protzko2020")

## forestplots of effect estimates
parOld <- par(mar = c(5, 8, 4, 2), mfrow = c(2, 2))
experiments <- unique(protzko2020$experiment)[1:4]
for (ex in experiments) {
  ## compute CIs
  dat <- subset(protzko2020, experiment == ex)
  za <- qnorm(p = 0.975)
  plotDF <- data.frame(lower = dat$smd - za*dat$se,
                       est = dat$smd,
                       upper = dat$smd + za*dat$se)
colpalette <- c("#000000", "#1B9E77", "#D95F02")
cols <- colpalette[dat$type]
yseq <- seq(1, nrow(dat))

## forestplot
plot(x = plotDF$est, y = yseq, xlim = c(-0.15, 0.8),
     ylim = c(0.8*min(yseq), 1.05*max(yseq)), type = "n",
     yaxt = "n", xlab = "Effect estimate (SMD)", ylab = "")
abline(v = 0, col = "#0000004D")
arrows(x0 = plotDF$lower, x1 = plotDF$upper, y0 = yseq, angle = 90,
       code = 3, length = 0.05, col = cols)
points(y = yseq, x = plotDF$est, pch = 20, lwd = 2, col = cols)
axis(side = 2, at = yseq, las = 1, labels = dat$type, cex.axis = 0.85)
title(main = ex)
}
@


\section{Design and analysis of replication studies}
Although a replication study needs to be planned and conducted before the results
can be analysed, we will first discuss the particular analysis approaches.
We do this because the chosen analysis strategy has a substantial impact on
the design of a replication study.
In the design phase of a replication study, we will then focus only on
the determination of the sample size.

\subsection{Statistical significance}
\paragraph{Analysis}
The most commonly used approach is to declare a replication successful if both the
original and replication study achieve statistical significance (in the same
direction). The significance level is conventionally chosen to be 0.05 for
two-sided $p$-values, respectively, 0.025 for one-sided $p$-values. Working with
one-sided $p$-values is usually simpler since effect direction is automatically
taken into account, while with two-sided $p$-values one has also to check
whether the original and replication estimate go in the same direction.
% There are some variations of this approach, for example,
% \citet{Camerer2016} only assessed whether the replication effect is significant in
% the same direction, but not whether the original effect shows the same significance
% status.
For the four one-to-one replication projects
we can simply check whether the one-sided $p$-values
(in the positive direction) of original and replication are both below the
conventional threshold 0.025.

<< >>=
for (p in unique(RProjects$project)) {
    data_project <- subset(RProjects, project == p)
    significant_O <- data_project$po1 < 0.025
    significant_R <- data_project$pr1 < 0.025
    success <- significant_O & significant_R
    cat(paste0(p, ": \n"))
    cat(paste0(round(mean(significant_O)*100, 1), "% original studies significant (",
               sum(significant_O), "/", length(significant_O), ")\n"))
    cat(paste0(round(mean(significant_R)*100, 1), "% replications significant (",
               sum(significant_R), "/", length(significant_R), ")\n"))
    cat(paste0(round(mean(success)*100, 1),
               "% both studies significant in the same direction (",
               sum(success), "/", length(success), ")\n \n"))
}
@

Despite its appealing simplicity, assessing replication success with statistical
significance is often criticized.
For example, non-significant replication results are expected if the original
finding was a false positive,
% (\eg with 95\% probability if the two-sided
% significance level is 5\%), on the other hand,
however, they can also be caused due to low power of the replication study
\citep{Goodman1992}.
On the other hand, statistical significance can still be achieved for a %trivial
replication effect estimate which is much smaller than the one from the
original study, provided its standard error is small enough (\eg because of a
very large replication sample size).

\paragraph{Design}
Selecting the same sample size in the replication study as in the original study may
lead to a severely underpowered design and as a result, true effects may not be
detected.
To ensure that the replication study reliably detects true effects,
the studies should be well-powered. In classical sample size planning,
usually a clinically relevant effect is specified and the sample size is then
determined so that it can be detected with a certain power. Luckily, in the
replication setting the clinically relevant effect does not need to be specified but
can be replaced with the effect estimate from the original study.
However, using the standard sample size calculation approach is not well suited,
because the uncertainty of the original effect estimate is ignored.

One way of tackling this issue is to use a Bayesian approach, incorporating
the original estimate and its precision into a design prior that is used for power
calculations. This corresponds to the concept of ``predictive power'' and generally
leads to larger sample sizes than the standard method.
In practice, however, often more ad hoc approaches are used. For instance, the
original estimate is just shrunken by an (arbitrary) constant, \eg it was halved in the
social sciences replication project, and standard sample size calculations are
then carried out.


Using the function \texttt{sampleSizeSignificance}, it is
straightforward to plan the sample size of the replication study with
the just mentioned approaches. The argument \texttt{designPrior}
allows to carry out sample size planning based on classical power
ignoring the uncertainty (\texttt{"conditional"}) or based on
predictive power (\texttt{"predictive"}).  Moreover, ad hoc shrinkage
can be specified with the argument \texttt{shrinkage}. Note that the
function \texttt{sampleSizeSignificance}, as well as most of the
functions from the package, takes $z$-values (and not $p$-values) as
arguments.  The transformation from $p$- to $z$-values and vice versa can
easily be done using the functions \texttt{p2z} and \texttt{z2p}.

The following code shows a few examples. Note that the function returns the required
relative sample size $c = n_r/n_o$, \ie the factor by which the sample size of
the original study needs to be multiply for the replication study.
<< >>=
sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "conditional")
sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "predictive")
sampleSizeSignificance(zo = 2.5, power = 0.8, level = 0.05, designPrior = "conditional",
                       shrinkage = 0.25)
@
The power of the replication study for a fixed relative sample size $c$
can be calculated with the function \texttt{powerSignificance}.
Figure \ref{fig:powerSignificance} shows the power to achieve significance in the
replication as a function of either the (one-sided) $p$-value or the $z$-value
of the original study. If the original estimate was just significant at the 0.025 level,
the probability for significance in the replication is just about 0.5 for conditional
and predictive power.
This result was first mentioned by \citet{Goodman1992} already two decades ago,
yet many practitioners of statistics still find it counterintuitive, because they
confuse type I error rates with replication probabilities. Thus, for the replication
to achieve significance with high probability, the sample size needs to be
increased compared to the original if the the evidence for the original discovery
was only weak or moderate (Figure \ref{fig:sampleSizeSignificance}).



\begin{figure}[!h]
<< "plot-powerSignificance", echo = FALSE, fig.height = 4 >>=
po <- seq(0.0001, 0.05, 0.0001)/2

## plot power
plot(po, powerSignificance(zo = p2z(po, alternative = "one.sided"),
                           designPrior = "conditional")*100,
     type = "l", ylim = c(0, 100), lwd = 1.5, ylab = "Power (%)",
     xlab = expression(italic(p)[o]), las = 1, yaxt = "n")
axis(side = 2, at = seq(0, 100, 25), las = 1)
axis(side = 3, at = seq(0.0, 0.025, by = 0.005),
     labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005),
                              alternative = "one.sided"), 2)))
mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2)
abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3))
# abline(h = 50, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4))
# abline(h = 50, col = "#333333B3", lty = 3)
lines(po, powerSignificance(zo = p2z(po, alternative = "one.sided"),
                            designPrior = "predictive")*100, lwd = 2, lty = 2)
legend("topright", legend = c("conditional", "predictive"),
       title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n")
@
\caption{Power to achieve significance of the replication study
at the one-sided 2.5\% level as a function of the (one-sided) $p$-value or
$z$-value of the original
study using the same sample size as in the original study.}
\label{fig:powerSignificance}
\end{figure}


\begin{figure}[!h]
<< "plot-sampleSizeSignificance", echo = FALSE, fig.height = 4 >>=
## plot sample size
plot(po, sampleSizeSignificance(zo = p2z(po, alternative = "one.sided"),
                                designPrior = "conditional", power = 0.8),
     type = "l", ylim = c(0.5, 8), log = "y", lwd = 1.5,
     ylab = expression(paste("Relative sample size ", n[r]/n[o])),
     xlab = expression(italic(p)[o]), las = 1, yaxt = "n")
axis(side = 3, at = seq(0.0, 0.025, by = 0.005),
     labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005),
                              alternative = "one.sided"), 2)))
axis(side = 2, las = 1, at = c(0.5, 1, 2, 4, 8, 16, 32),
     labels = c("1/2", "1", "2", "4", "8", "16", "32"))
mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2)
abline(h = c(0.5, 1, 2, 4, 8), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3))
# abline(h = 1, col = "#333333B3", lty = 3)
abline(h = 1, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4))
lines(po, sampleSizeSignificance(zo = p2z(po, alternative = "one.sided"),
                                 designPrior = "predictive", power = 0.8),
      lwd = 2, lty = 2)
legend("topleft", legend = c("conditional", "predictive"),
       title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n")
@
\caption{Relative sample size to achieve significance of the replication study
at the one-sided 2.5\% level with 80\% power as a function of the (one-sided)
$p$-value or $z$-value of
original study.}
\label{fig:sampleSizeSignificance}
\end{figure}




\subsection{Compatibility of effect size}
\paragraph{Analysis}
Another way for analysing a replication study is to examine the statistical compatibility
between original and replication effect estimate. A popular approach is to check
whether the replication estimate is contained within a prediction interval
based on the original estimate \citep{Patil2016, Pawel2020}.
This approach based on a $(1 - \alpha)$
prediction interval is equivalent to conducting a meta-analytic $Q$-test
with the two estimates, and rejecting compatibility when the corresponding $p$-value
$p_Q < \alpha$ \citep{Hedges2019b}. The $p$-value from the $Q$-test is usually
preferred, since it tells quantitatively how compatible the estimates are. In contrast,
a prediction interval can give a better idea about the range of plausible
replication effect estimates besides the observed one. Both approaches are
available in  \texttt{ReplicationSuccess}: The function \texttt{Qtest} returns
the $p$-value from the meta-analytic $Q$-test, whereas the function
\texttt{predictionInterval} returns a prediction interval for the replication
effect based on the original counterpart (see the documentation of the functions
for further details).
% , a prediction interval of the replication effect
% estimate can be computed under different predictive
% distributions which depend on the design prior. The default design prior
% \texttt{"predictive"} is likely the choice most people would want to use as it takes
% into account the uncertainty of the original estimate without shrinking it.

For the four data sets, we can easily compute $p$-values from $Q$-test, as well
as 95\% prediction intervals. For easier visual
assessment we transform the intervals and estimates back to the correlation scale.

<< "plot-predictionInterval", fig.height = 6 >>=
## compute prediction intervals for replication projects
par(mfrow = c(2, 2), las = 1, mai = rep(0.65, 4))
for (p in unique(RProjects$project)) {
    data_project <- subset(RProjects, project == p)
    pQ <- Qtest(thetao = data_project$fiso,
                thetar = data_project$fisr,
                seo = data_project$se_fiso,
                ser = data_project$se_fisr)
    PI <- predictionInterval(thetao = data_project$fiso,
                             seo = data_project$se_fiso,
                             ser = data_project$se_fisr)
    ## transforming back to correlation scale
    PI <- tanh(PI)
    incompatible <- pQ < 0.05 ## incompatible at 5% level
    color <- ifelse(incompatible == FALSE, "#00000099", "#DF536BCC")
    study <- seq(1, nrow(data_project))
    plot(data_project$rr, study, col = color, pch = 20, cex = 0.5,
         xlim = c(-0.5, 1), xlab = expression(italic(r)[r]), ylab = "Study",
         main = paste0(p, ": ", round(mean(incompatible)*100, 0), "% incompatible"))
    arrows(PI$lower, study, PI$upper, study, length = 0.02,
           angle = 90, code = 3, col = color)
    abline(v = 0, lty = 3)
}
@

While both approaches enable statements about compatibility of original
and replication effect estimates, they carry a fundamental problem due to the
structure of the underlying hypothesis test: If $p_Q < \alpha$ (or equivalently
the $(1- \alpha)$ prediction interval does not contain the replication estimate)
one has established incompatibility/non-replication. If, however, one fails to
establish incompatibility, it remains unclear whether this is due to small power
or true compatibility of both estimates \citep{Hedges2019b}.

% The criticism that this approach receives is that for studies which are
% underpowered, the prediction intervals will become very wide.
% This in turn can lead to very different effect estimates being compatible,
% \eg even ones that go in the opposite direction, ultimately providing
% no information about the effect itself (which actually happens in some
% cases in the economics and philosophy data sets).


\subsection{The sceptical $p\,$-value}


<<example-morewedge, echo = FALSE>>=
## data from study
morewedge <- subset(RProjects, study == "Morewedge et al. (2010), Science")

## Functions
ssv <- function(zo, so, a) {
    tau2 <- so^2/(zo^2/qnorm(p = a/2, lower.tail = FALSE)^2 - 1)
    return(tau2)
}

## Parameters and computations
optionsOld <- options(scipen = 5)
theta_o <- morewedge$fiso
theta_r <- morewedge$fisr
so <- morewedge$se_fiso
sr <- morewedge$se_fisr
c <- so^2/sr^2
zo <- theta_o/so
zr <- theta_r/sr
alpha <- 0.05
za <- qnorm(p = alpha/2, lower.tail = FALSE)
ps <- signif(pSceptical(zo = zo, zr = zr, c = c, alternative = "two.sided", type = "nominal"), 2)
ps1 <- signif(pSceptical(zo = zo, zr = zr, c = c,  alternative = "one.sided", type = "nominal"), 2)
ps1r <- signif(pSceptical(zo = zo, zr = zr, c = c,  alternative = "one.sided", type = "golden"), 2)
tau <- sqrt(ssv(zo, so, alpha)) # sufficiently sceptical prior sd at alpha = 0.05
s2_p <- 1/(1/so^2 + 1/tau^2) # posterior variance
mu_p <- s2_p*theta_o/so^2 # posterior mean
@

\paragraph{Analysis}
The \emph{sceptical $p$-value}, a new quantitative measure of replication success
was first introduced in \citet{Held2020}.
Conceptually, replication success is declared if the
replication study is in conflict with a sceptical prior that would render the
original study non-significant.
The sceptical $p$-value arises from combining the analysis of credibility
\citep{Matthews2001a} with the prior-predictive check \citep{Box1980}.
Specifically, using Bayes theorem in reverse, the prior distribution of the effect
size is determined such that based on the original study, the $(1 - \alpha)$
credible interval of the posterior distribution of the effect just includes zero.
This prior corresponds to the objection of a sceptic who argues
that the original finding is no longer significant if combined with a sufficiently
sceptical prior.
Replication success at level $\alpha$ is then achieved if the tail probability of
the replication estimate under its prior predictive distribution
is smaller than $\alpha$, rendering the objection of the sceptic unrealistic.
The smallest level $\alpha$ at which replication success can be declared
defines the sceptical $p$-value.


The method has attractive properties: The sceptical $p$-value is never smaller than
the ordinary $p$-values from both studies which ensures that both studies have to be
sufficiently convincing on their own such that replication success is possible.
It also takes into account the size of
the effect estimates, \ie it becomes larger if the replication estimate is
smaller than the original estimate, which guarantees that shrinkage of the
replication effect estimate is penalized.
One-sided sceptical $p$-values are preferred because they guarantee that replication
success is only possible if the directions of original and replication effect
estimates are the same.
We will only report one-sided $p$-values in the following.
\hl{The sceptical $p$-value in its original formulation (\hl{`nominal'}) can be easily computed with the function
\texttt{pSceptical} and \texttt{type = "nominal"}.}

Figure~\ref{fig:examplerevBayes} shows the original study from
\citet{Morewedge2010} and its replication by the SSRP \citep{Camerer2018}.
In this example, the one-sided sceptical $p$-value turns out
to be $p_\text{S} = \Sexpr{ps1}$.


% practical illustration of the procedure,
% also see \citet{Held2020} and \citet{Held_etal2022} for technical details.

\begin{figure}[!h]
<< "plot-pSceptical", echo = FALSE, fig.height = 4 >>=

## Plot
df <- data.frame(estimate = factor(c("Original Study",
                                     # "Posterior",
                                     # "Sceptical Prior",
                                     "Replication Study"),
                                   levels = c("Original Study",
                                              # "Posterior",
                                              # "Sceptical Prior",
                                              "Replication Study")),
                 x = c(1,  2),
                 col = c(1, 4),
                 theta = c(theta_o,
                           theta_r),
                 lower = c(theta_o - za*so,
                           theta_r - za*sr),
                 upper = c(theta_o + za*so,
                           theta_r + za*sr),
                 p = c(signif(2*pnorm(q = zo, lower.tail = FALSE), 1),
                       signif(2*pnorm(q = zr, lower.tail = FALSE), 2)))

ylims <- c(-0.1, 1)
xlims <- c(0.5, 2.5)
cex <- 1
plot(x = df$x, y = df$theta, pch = " ",
     xlim = xlims, ylim = ylims,
     xlab = "", ylab = "Effect Size", xaxt = "n", las = 1)
axis(side = 1, at =  df$x, labels =  df$estimate,
     cex.axis = 0.9, mgp = c(3, 2, 0))
abline(h = seq(-2, 2, 0.25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.4))
abline(h = 0, lty = 2)

# ## one-sided CIs
# arrows(x0 = df$x[-3], x1 = df$x[-3], y0 = df$theta[-3] + 0.05, y1 = df$upper[-3],
#        col = df$col[-3], code = 2, angle = 90, length = 0, lwd = 1.5)
# points(x = df$x[-3], y = df$upper[-3], pch = 17, cex = 1.3*cex, col = df$col[-3])

## two-sided CI
arrows(x0 = df$x, x1 = df$x, y0 = df$theta + 0.05, y1 = df$upper,
       col = df$col, code = 2, angle = 90, length = 0.1, lwd = 1.5)
arrows(x0 = df$x, x1 = df$x, y0 = df$theta - 0.05, y1 = df$lower,
       col = df$col, code = 2, angle = 90, length = 0.1, lwd = 1.5)

## arrows and text
points(x = df$x, y = df$theta, pch = 20, cex = cex*1.5, col = df$col)

text(x = df$x[1] - 0.2, y = df$theta[1],
       labels = bquote(paste(hat(theta)[o], " = ", .(round(df$theta[1], 2)))))


text(x = df$x[2] + 0.2, y = df$theta[2],
       labels = bquote(paste(hat(theta)[r], " = ", .(round(df$theta[2], 2)))))

text(x = df$x[1], y = df$upper[1] + 0.1,
       labels = bquote(paste(p[o], " = ", .(round(morewedge$po1, 4)))))


text(x = df$x[2], df$upper[2] + 0.1,
       labels = bquote(paste(p[r], " = ", .(round(morewedge$pr1, 4)))))


text(x = 1.5, y = 0.3,
     labels = bquote(paste(p[S], " = ", .(ps1))))




# arrows(x0 = 2.1, x1 = 2.9, y0 = 0.9, y1 = 0.9, length = 0.05, lwd = 1.2, col = "#000000B2")
# text(x = 2.5, y = 0.98, labels = "reverse-Bayes", cex = 0.75, col = "#000000B2")
# arrows(x0 = 3.15, x1 = 4.5, y0 = 0.9, y1 = 0.9, length = 0.05, lwd = 1.2, col = "#000000B2")
# text(x = 3.8, y = 0.98, labels = "prior-data conflict assessment", cex = 0.75, col = "#000000B2")
# points(x = 2, y = 0, pch = 1, cex = 2)
# arrows(x0 = 1.85, x1 = 1.98, y0 = -0.2, y1 = -0.03, length = 0.05, col = "#000000B2")
# text(x = 1.85, y = -0.25, labels = "fixed at zero", col = "#000000B2", cex = 0.75)
box()

@
\caption{Original study from
\citet{Morewedge2010} and its replication by the SSRP \citep{Camerer2018}. Shown
are the effect estimates $\hat\theta_o$, $\hat\theta_r$ together
with the 95\% confidence interval and the one-sided $p$-values $p_o$ and $p_r$
and the nominal sceptical $p$-value $p_S$.
% The one-sided sceptical $p$-value is $p_\text{S} = \Sexpr{ps1}$ for this
% example.
% if a
% recalibration with the golden level is applied it becomes
% $\tilde{p}_\text{S} = \Sexpr{ps1r}$.
}
\label{fig:examplerevBayes}
\end{figure}

The example study can be isolated with the following command:
<<morewedge, echo = TRUE>>=
morewedge <- subset(RProjects, study == "Morewedge et al. (2010), Science")
@




% This method provides a sound approach to quantify
% replication success and it has attractive properties.


<<echo = FALSE>>=
ps_gold <- signif(pSceptical(zo = zo, zr = zr, c = c,  alternative = "one.sided", type = "golden"), 2)
ps_contr <- signif(pSceptical(zo = zo, zr = zr, c = c,  alternative = "one.sided", type = "controlled"), 2)
@


{\flushleft\textit{\hl{Interpretation and recalibration}}} \\
\hl{Interpretation of the sceptical $p$-value as a continuous
measure of replication success is recommended. However, if an answer to
``did it replicate?'' is needed, \hl{a threshold is required}.
The sceptical $p$-value can be thresholded at $\alpha$ just as an ordinary
$p$-value. Our default choice is $\alpha = 0.025$ for one-sided $p$-values.}

\hl{\citet{Held2020} calculated the nominal sceptical $p$-values of a number
of case studies and a relatively low rate of replication success
was found. In addition,
\citet{Held_etal2022} further studied properties of the \hl{nominal
sceptical $p$-value} and concluded that it may be too stringent for
most realistic scenarios. Two recalibrations
were subsequently proposed: the golden and the controlled sceptical $p$-values.
They can be computed using the \texttt{pSceptical} function with
\texttt{type = "golden"} or \texttt{type = "controlled"}, respectively,
and should also be thresholded at $\alpha$.}

\hl{The two recalibration types can be summarized as follows:}
  \begin{itemize}
    \item \hl{The \emph{golden} sceptical $p$-value \citep{Held_etal2022} provides an attractive
    balance between significance testing and effect size comparison. It
    ensures that for original
    studies, which are just significant at the specified significance level
    $\alpha$, replication success is only possible if the replication effect
    estimate is larger than the original one (\ie if  there is no shrinkage).
    The golden sceptical
    $p$-value controls the overall Type-I error rate for $c \geq 1$,
    so does not provide exact overall Type-I error control for any $c$.
     This is our recommended default type of recalibration.
    In the \citet{Morewedge2010} replication example, we obtain
    $p_S = \Sexpr{ps_gold} < \Sexpr{alpha/2}$, so the replication is
    successful with the golden sceptical $p$-value.}
    %
    \item \hl{The \emph{controlled} sceptical $p$-value \citep{Held_etal2022-Fram}
    ensures exact overall Type-I error control
    for any value of the variance ratio $c$. The overall T1E rate is controlled
    at level $\alpha^2$
for \texttt{alternative = "two.sided"} (where $\alpha$ is two-sided)
or \texttt{"one.sided"} (where $\alpha$ is one-sided)
if the direction was pre-specified
in advance.
% For alternative is \texttt{"one.sided"}
% and no pre-specified direction, the
% Type-I error rate is controlled at level $2\alpha^2$
% (the conventional level for type I error control of two
% independent experiments with
% two-sided testing at level $2\alpha$ in the first and one-sided
% testing at level $\alpha$ in the second).
  In the \citet{Morewedge2010} replication example, we obtain
    $p_S = \Sexpr{ps_contr} < \Sexpr{alpha/2}$, so the replication is also
    successful with the controlled sceptical $p$-value.}
  \end{itemize}

\hl{The three types of sceptical $p$-values for the \citet{Morewedge2010} example are calculated
as follows:}

<<echo = TRUE>>=
print(pS_nominal <- pSceptical(zo = morewedge$zo, zr = morewedge$zr,
                          c = morewedge$c,  alternative = "one.sided",
                          type = "nominal"))
print(pS_golden <- pSceptical(zo = morewedge$zo, zr = morewedge$zr,
                         c = morewedge$c,  alternative = "one.sided",
                         type = "golden"))
print(pS_controlled <- pSceptical(zo = morewedge$zo, zr = morewedge$zr,
                             c = morewedge$c,  alternative = "one.sided",
                             type = "controlled"))
@



The one-sided  \hl{golden and controlled} sceptical $p$-values are computed
for the four one-to-one replication projects as follows:
<< "plot-pSceptical-projects", fig.height = 4 >>=
## computing one-sided golden and controlled sceptical p-value for replication projects
RProjects$psG <- with(RProjects,
                     pSceptical(zo = zo, zr = zr, c = c,
                                alternative = "one.sided", type = "golden"))
RProjects$psC <- with(RProjects,
                     pSceptical(zo = zo, zr = zr, c = c,
                                alternative = "one.sided", type = "controlled"))
@

And the success rates in the four projects with the statistical significance criterion,
and the golden and controlled sceptical $p$-value are

<<echo = FALSE>>=

for (p in unique(RProjects$project)) {
    data_project <- subset(RProjects, project == p)
    cat(paste0(p, ": \n"))
    success_sceptG <- (data_project$psG < 0.025)
    cat(paste0(round(mean(success_sceptG)*100, 2),
               "% smaller than 0.025 (one-sided golden sceptical p-value) \n"))
    success_sceptC <- (data_project$psC < 0.025)
    cat(paste0(round(mean(success_sceptC)*100, 2),
               "% smaller than 0.025 (one-sided controlled sceptical p-value) \n"))
    success_tradit <- (data_project$po1 < 0.025) & (data_project$pr1 < 0.025)
    cat(paste0(round(mean(success_tradit)*100, 2),
               "% smaller than 0.025 (both one-sided traditional p-values) \n"))
    if(sum(success_sceptG != success_tradit) > 0){
        discrep <- data_project[(success_sceptG != success_tradit),
                                c("ro", "rr", "c", "po1", "pr1", "psG")]
        ## print effect estimates, 1sided p-values, and c of discrepant studies
        cat("Discrepant studies (golden vs significance): \n")
        print(signif(discrep, 2), row.names = FALSE)
    }

        if(sum(success_sceptC != success_tradit) > 0){
        discrep <- data_project[(success_sceptC != success_tradit),
                                c("ro", "rr", "c", "po1", "pr1", "psC")]
        ## print effect estimates, 1sided p-values, and c of discrepant studies
        cat("Discrepant studies (controlled vs significance): \n")
        print(signif(discrep, 2), row.names = FALSE)
        }

        if(sum(success_sceptC != success_sceptG) > 0){
        discrep <- data_project[(success_sceptC != success_sceptG),
                                c("ro", "rr", "c", "po1", "pr1", "psG", "psC")]
        ## print effect estimates, 1sided p-values, and c of discrepant studies
        cat("Discrepant studies (golden vs controlled): \n")
        print(signif(discrep, 2), row.names = FALSE)
  }
  cat("\n \n")
}
@


For most studies the three approaches agree, but there is
disagreement for seven of them.
The golden sceptical $p$-value may not indicate
replication success when there is substantial shrinkage of the replication
effect estimate relative to the original one, even if both estimates are
significant (this is the case for one study in the psychology project, two
studies in the social sciences project, one study in the philosophy project).
In contrast, provided there is not much shrinkage, it may still indicate
replication success for non-significant original or replication studies (this is
the case for two studies in the psychology project).
On
the other hand, the controlled sceptical $p$-value can indicate success for non-significant
original or replication study, even in the presence of considerable
shrinkage (this is the case
for one study in the psychology project and one in the experimental economics
project).

The following plot shows golden $p_S$ versus controlled $p_S$ for all study
pairs, stratified by project.

<<golden-vs-controlled, fig.height = 6>>=
par(mfrow = c(2, 2), las = 1, pty = "s",
    mar = c(4, 2.5, 2.5, 1))
  myaxis <- c(10e-5, 0.001, 0.01, 0.1, 1)
for (p in unique(RProjects$project)) {
  data_project <- subset(RProjects, project == p)
  plot(psG ~ psC, data = data_project, ylim = c(10e-5, 1),
       xlim = c(10e-5, 1), main = p, xlab = expression(paste("controlled ", p[S])),
       ylab = expression(paste("golden ", p[S])),
       pch = 19,
       col = "#00000099",
       cex = 1.3,
       axes = F,
       log = "xy")
  abline(h = 0, lty = 2)
  abline(a = 0, b = 1, col = "grey")
  abline(h = 0.025, lty = 2, col = "grey")
  abline(v = 0.025, lty = 2, col = "grey")

axis(1, at = myaxis, labels = myaxis)
axis(2,at = myaxis, labels = myaxis)
box()
}

@


  << "threshold-p-sceptical2", echo = FALSE >>=
## computing nominal, controlled, and golden replication success levels 
## for one-sided uncalibrated sceptical p-value
thresh_gol <- levelSceptical(level = 0.025, alternative = "one.sided",
                              type = "golden")
thresh_contr <- levelSceptical(level = 0.025, alternative = "one.sided",
                                type = "controlled", c = c(1, 2))
thresh_nom <- levelSceptical(level = 0.025, alternative = "one.sided",
                              type = "nominal")
@





{\flushleft\textit{\hl{Replication success level}}} \\
%
\hl{Instead of recalibrating the sceptical $p$-value and thresholding
it at $\alpha$, it is equivalent to use the uncalibrated
sceptical $p$-value and threshold it at the replication success
level $\gamma$.
The replication success level is computed using \texttt{levelSceptical}
and the recalibration \texttt{type} (\texttt{"golden"} or \texttt{"controlled"}).
The golden replication success level depends on $\alpha$ and is
$\Sexpr{round(thresh_gol, 3)}$ for $\alpha = 0.025$.
The controlled replication success level depends on $\alpha$ and the variance ratio $c$.
It is for example $\Sexpr{round(thresh_contr[1], 3)}$ for $c = 1$ and
$\Sexpr{round(thresh_contr[2], 3)}$ for $c = 2$.}
\hl{Using \texttt{type = "nominal"} simply returns $\alpha$.}

<< "threshold-p-sceptical" >>=
## computing nominal, golden and controlled replication success levels
## for one-sided uncalibrated sceptical p-value

print(rs_level_nom <- levelSceptical(level = 0.025, alternative = "one.sided",
                              type = "nominal"))
print(rs_level_gol <- levelSceptical(level = 0.025, alternative = "one.sided",
                              type = "golden"))
print(rs_level_contr <- levelSceptical(level = 0.025, alternative = "one.sided",
                                type = "controlled", c = c(1, 2)))

@


\hl{The replication success level $\gamma$ is also the bound
on the partial Type-I error rate of the sceptical $p$-value
\citep[Section 3.1]{Held_etal2022-Fram}. This means that the individual
$p$-values $p_o$ and $p_r$ both need to be smaller than $\gamma$
for replication success to be possible with the sceptical $p$-value.}




\paragraph{Design}
Sample size calculations work in a similar manner as when they are based
on statistical significance:
% Design works similarly as for the statistical significance analysis strategy;
Using the function \texttt{sampleSizeReplicationSuccess}, we need to choose a design
prior, \hl{a threshold $\alpha$ for the sceptical $p$-value,} a recalibration
type, and the desired power
to obtain the required relative sample size $c = n_r/n_o$.
The following code shows two examples.

<< >>=
sampleSizeReplicationSuccess(zo = 2.5, power = 0.8, level = 0.025,
                             alternative = "one.sided",
                             designPrior = "conditional",
                             type = c("golden", "controlled"))

sampleSizeReplicationSuccess(zo = 2.5, power = 0.8, level = 0.025,
                             alternative = "one.sided",
                             designPrior = "predictive",
                             type = c("golden", "controlled"))
@

The function \texttt{powerSignificance} allows to calculate the
power for a fixed relative sample size $c$.
Figure \ref{fig:powerReplicationSuccess} shows the power to achieve replication
success \hl{with the golden and controlled sceptical $p$-values}
as a function of the one-sided $p$-value (or
$z$-value) of the original study, assuming equal sample sizes in original and
replication studies ($c = 1$).

% The probability for replication success if the original
% study showed only weak evidence ($p_o = 0.025$) is now smaller than 0.5,
% which is reached for an original $p$-value of around 0.015.

\begin{figure}[!h]
<<"plot-powerReplicationSuccess", echo = FALSE, fig.height = 4>>=
## plot power
po <- seq(0.001, 0.05, length.out = 100)/2
plot(po, powerReplicationSuccess(zo = p2z(po, alternative  = "one.sided"),
                                 designPrior = "conditional",
                                 level = 0.025,
                                 alternative = "one.sided",
                                 type = "golden")*100,
     type = "l", ylim = c(0, 100), lwd = 1.5, ylab = "Power (%)", las = 1,
     xlab = expression(italic(p)[o]), yaxt = "n")
abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3))
# abline(h = 50, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4))
axis(side = 2, at = seq(0, 100, 25), las = 1)
axis(side = 3, at = seq(0.0, 0.025, by = 0.005),
     labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005),
                              alternative = "one.sided"), 2)))
mtext(text = expression(paste( italic(z)[o])), side = 3, line = 2)
lines(po, powerReplicationSuccess(zo = p2z(po, alternative = "one.sided"),
                                  designPrior = "predictive",
                                  level = 0.025,
                                  alternative = "one.sided",
                                  type = "golden")*100,
      lwd = 2, lty = 2)
lines(po, powerReplicationSuccess(zo = p2z(po, alternative  = "one.sided"),
                                 designPrior = "conditional",
                                 level = 0.025,
                                 alternative = "one.sided",
                                 type = "controlled")*100,
      col = "red", lwd = 2)
lines(po, powerReplicationSuccess(zo = p2z(po, alternative  = "one.sided"),
                                 designPrior = "predictive",
                                 level = 0.025,
                                 alternative = "one.sided",
                                 type = "controlled")*100,
      col = "red", lty = 2, lwd = 2)

legend("topright", legend = c("conditional", "predictive"),
       title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n")
# abline(h = 50, lty = 3)

legend("bottomleft",
       legend = c(expression(paste("golden ", p[S])),
                  expression(paste("controlled ", p[S]))),
       col = c("black", "red"),
       lty = 1, lwd = 2,
       bty = "n")
@
\caption{Power to achieve replication success (with the golden and controlled
sceptical $p$-value and
$\alpha = 0.025$) as a function of the one-sided $p$-value or $z$-value of
the original study.}
\label{fig:powerReplicationSuccess}
\end{figure}
%
Figure \ref{fig:sampleSizeReplicationSuccess} shows the required sample size to
achieve replication success \hl{with the golden and controlled sceptical $p$-value}
with 80\% power.
With the golden sceptical $p$-value, the required relative sample sizes
gets very large for borderline significant original studies and is even
impossible for non-significant original studies.

We thus recommend to use the controlled sceptical $p$-value for
sample size calculations, which also allows sample size calculation
for non-significant original studies.

\begin{figure}[!h]
<< "plot-sampleSizeReplicationSuccess", echo = FALSE, fig.height = 4>>=
# po <- seq(0.0001, 0.05, 0.0001)
po <- seq(0.001, 0.05, length.out = 100)/2

## plot sample size
plot(po,
     sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"),
                                  power = 0.8,
                                  level = 0.025,
                                  designPrior = "conditional",
                                  alternative = "one.sided",
                                  type = "golden"),
     type = "l", log = "y", lwd = 1.5, las = 1, ylim = c(0.5, 20),
     ylab = expression(paste("Relative sample size ", n[r]/n[o])),
     xlab = expression(italic(p)[o]), yaxt = "n")
axis(side = 2, las = 1, at = c(0.5, 1, 2, 4, 8, 16, 32),
     labels = c("1/2", "1", "2", "4", "8", "16", "32"))
# abline(h = 1, lty = 3)
abline(h = c(0.5, 1, 2, 4, 8, 16, 32), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3))
abline(h = 1, lty = 1, lwd = 1.5, col = adjustcolor("lightgrey", alpha.f = 0.4))
axis(side = 3, at = seq(0.0, 0.025, by = 0.005),
     labels = c("", round(p2z(p = seq(0.005, 0.025, by = 0.005),
                              alternative = "one.sided"), 2)))
mtext(text = expression(paste(italic(z)[o])), side = 3, line = 2)
suppressWarnings({
    lines(po, sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"),
                                           power = 0.8, level = 0.025,
                                           designPrior = "predictive",
                                           alternative = "one.sided",
                                           type = "golden"),
          lwd = 2, lty = 2)

  lines(po,
     sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"),
                                  power = 0.8,
                                  level = 0.025,
                                  designPrior = "conditional",
                                  alternative = "one.sided",
                                  type = "controlled"),
     col = "red", lwd = 2)

    lines(po,
     sampleSizeReplicationSuccess(zo = p2z(po, alternative = "one.sided"),
                                  power = 0.8,
                                  level = 0.025,
                                  designPrior = "predictive",
                                  alternative = "one.sided",
                                  type = "controlled"),
     col = "red", lwd = 2, lty = 2)
})
legend("topleft", legend = c("conditional", "predictive"),
       title = "Design prior", lty = c(1, 2), lwd = 1.5, bty = "n")

legend("bottomright",
       legend = c(expression(paste("golden ", p[S])),
                  expression(paste("controlled ", p[S]))),
       col = c("black", "red"),
       lty = 1, lwd = 2,
       bty = "n")
@
\caption{Relative sample size to achieve replication success (with the golden
and controlled sceptical $p$-values and
$\alpha = 0.025$) with 80\% power as a function of the (one-sided)
$p$-value or $z$-value of the original study.}
\label{fig:sampleSizeReplicationSuccess}
\end{figure}

\subsection{Relative effect size}
\paragraph{Analysis}
The requirements on the replication $p$-value ($p_r <  \alpha$)
and on the sceptical $p$-value ($p_S < \alpha$)
can both be transformed to requirements on the relative effect size
$d = \hat{\theta}_r/\hat{\theta}_o$ \citep{Held_etal2022}.
% (the
% ratio of the replication effect estimate to the original effect estimate).
In short, replication success is
declared if the relative effect size is larger than a certain bound
(the minimum relative effect size $\dmin$). The minimum relative effect size
$\dmin$ is computed based on
the result from the original study, the relative sample size, the
significance level/threshold for the sceptical $p$-value.
The type of recalibration is also required for the sceptical $p$-value.
The functions are
\texttt{effectSizeSignificance} and \texttt{effectSizeReplicationSuccess}
for significance and the sceptical $p$-value, respectively.
<< echo = FALSE >>=
d <- morewedge$fisr/morewedge$fiso
dminrs <- effectSizeReplicationSuccess(zo = morewedge$zo, c = morewedge$c)
dminsign <- effectSizeSignificance(zo = morewedge$zo, c = morewedge$c)
@
For the \citet{Morewedge2010} example, the
minimum relative effect size to achieve replication success is
$\dmin = \Sexpr{round(dminsign, 2)}$ for significance and
$\dmin = \Sexpr{round(dminrs, 2)}$ with the golden
sceptical $p$-value.
Since the observed relative
effect size  is $d = \Sexpr{round(d, 2)}$,
the replication is successful with both methods.


\section{Special topics}

\subsection{Interim analysis}
Adaptive designs are a type of designs where one or more interim analyses
are planned during the course of a study. This topic has extensively been
studied and used in clinical trials for example, where continuing a study
that should be stopped may lead to serious consequences.
However, this type of design has rarely been covered
in the framework of replication studies. An adaptive
design was adopted in the social sciences replication project, but without
a power (re)calculation at interim.

\texttt{ReplicationSuccess} allows to calculate the power of the replication
study after an interim analysis has been performed, taking into account the
results from the first part of the study.
The power at interim is a useful tool to decide whether a replication
study should be stopped prematurely for futility \citep{MicheloudHeld2021}.
The function
\texttt{powerSignificanceInterim} is an extension of \texttt{powerSignificance}
and requires in addition the specification of \texttt{zi}, the $z$-value
at the interim analysis and \texttt{f}, the fraction of the replication
study already completed. Moreover, the argument \texttt{designPrior} can
be set to  \texttt{"conditional"}, \texttt{"informed predictive"} and
\texttt{"predictive"}. Finally, the argument \texttt{analysisPrior} allows
to also take the original result into account in the analysis of the
replication study.
% The function \texttt{sampleSizeSignificanceInterim} can
% be used to re-estimate the sample size of the replication study when
% results from an interim analysis are available.

% Figure~\ref{fig:PowerSignificanceInterim} shows the conditional and the
% predictive power of the replication studies that continued into stage 2.
% While the conditional power at interim is larger than 80\% for all the studies,
% the predictive power is close to 0\% for some studies and always smaller
% than the conditional power.

% \begin{figure}
<< echo = FALSE, fig.height = 3.5, eval = FALSE >>=
data("SSRP")
intpow_cond <- with(SSRP, powerSignificanceInterim(zo = fiso/se_fiso, zi = fisi/se_fisi, c = nr/no,
                                                   f = ni/nr, designPrior = "conditional"))
intpow_pred <- with(SSRP, powerSignificanceInterim(zo = fiso/se_fiso, zi = fisi/se_fisi, c = nr/no,
                                                   f = ni/nr, designPrior = "predictive"))
plot(intpow_cond*100, intpow_pred*100,
     xlab = "Conditional power (in %)",
     ylab = "Predictive power (in %)",
     pch = 20,
     cex = 1.2,
     xlim = c(80, 100),
     ylim = c(0, 100),
     yaxt = "n", las = 1,
     col = "#00000099")
axis(side = 2, at = seq(0, 100, 25), las = 1)
abline(a = 0, b = 1, col = "grey")
@
% \caption{Conditional vs.\ predictive power at interim of the 10 studies from the
% Social Science Replication Project that were not stopped after stage 1. The grey
% line indicates the same value for conditional and predictive power.}
% \label{fig:PowerSignificanceInterim}
% \end{figure}

\subsection{Between-study heterogeneity}
It is often more realistic to assume that the unknown effect sizes from original
and replication studies are not exactly the same but that there is between-study
heterogeneity. This can be the case, for example, if the replication study is
conducted with a different population of participants (\eg in a different
country) or in another laboratory with different equipment. For this reason, the
functions for design of replication studies allow to incorporate additionally
uncertainty due to between-study heterogeneity. Some functions
in the package (\eg
\texttt{sampleSizeSignificance} or \texttt{predictionInterval}) allows to specify
the argument \texttt{h}, the relative between-study heterogeneity variance
$h = \tau^2/\sigma^2$, \ie the ratio of the heterogeneity variance to the
variance of the original effect estimate. By default, \texttt{h} is set to zero,
however, if between-study heterogeneity is expected, \eg a different population
of study participants is used, this should be considered in the design. See
\citet{Pawel2020} for more details.

\subsection{Data-driven shrinkage with empirical Bayes}
The functions for design of replication studies allow to specify the argument
\texttt{shrinkage}, in order to shrink the original effect estimate towards zero
by a certain (arbitrary) amount. A more principled approach is to use a design
prior which induces shrinkage and then estimate the prior variance by empirical
Bayes. This leads to ``data-driven'' shrinkage that is larger when there was
only weak evidence for the effect, and smaller when there was strong evidence
for the effect (shown in Figure \ref{fig:ebshrinkage}). Furthermore, under this
prior, the specified between-study heterogeneity will also induce shrinkage
towards zero, for details see \citet{Pawel2020}. Empirical Bayes shrinkage can
be enabled by setting the design prior argument to \texttt{"EB"}.
\begin{figure}[!htbp]
<<"shrinkage", echo = FALSE, fig.height = 3.5 >>=
zo <- seq(0, 4, 0.01)
s <- pmax(1 - 1/zo^2, 0)
shrinkage <- (1 - s)*100
plot(zo, shrinkage, type = "l", ylim = c(0, 100), las = 1,
     xlab = expression(paste(italic(z)[o])), ylab = "Shrinkage (%)",
     yaxt = "n")
axis(side = 2, at = seq(0, 100, 25), las = 1)
axis(side = 3, at = seq(0, 4, by = 1),
     labels = c(signif(z2p(seq(0, 3, by = 1), alternative = "one.sided"), 2),
                signif(z2p(4, alternative = "one.sided"), 1)))
abline(h = seq(0, 100, 25), lty = 1, col = adjustcolor("lightgrey", alpha.f = 0.3))
mtext(text = expression(italic(p)[o]), side = 3, line = 2)
@
\caption{Empirical Bayes shrinkage when there is no between-study heterogeneity.}
\label{fig:ebshrinkage}
\end{figure}


\section{Outlook}
Development on \texttt{ReplicationSuccess} will continue. We invite anyone with
ideas for new functionality, bug-reports, or other contributions to the package
to get in touch with us over GitHub (\url{https://github.com/crsuzh/ReplicationSuccess/}).

\bibliography{bibliography}

<<include=FALSE>>=
options(optionsOld)
par(parOld)
@

\end{document}