You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
129 lines
4.9 KiB
TeX
129 lines
4.9 KiB
TeX
\documentclass[../Main.tex]{subfiles}
|
|
\graphicspath{{\subfix{Assets/img/}}}
|
|
|
|
\begin{document}
|
|
%% Describe goal
|
|
|
|
The model I use is a
|
|
hierarchal logistic regression model where the
|
|
hierarchies correspond to the 22 top-level ICD-10 disease categories.
|
|
The goal is to take each snapshot and predict the probability of termination.
|
|
|
|
First, some notation:
|
|
\begin{itemize}
|
|
\item $i$: indexes trials
|
|
\item $n$: indexes trial snapshots.
|
|
\item $y_i$: whether each trial
|
|
terminated (true, 1) or completed (false, 0).
|
|
\item $d_i$: indexes the ICD-10 disease category of the trial.
|
|
\item $x_{i,n}$: represents the independent
|
|
variables associated with the snapshot.
|
|
\end{itemize}
|
|
|
|
The actual specification of the model to measure
|
|
the direct effect of enrollment is:
|
|
\begin{align}
|
|
y_i \sim \text{Bernoulli}(p_{i,n}) \\
|
|
p_{i,n} = \text{logit}(x_{i,n} \vec \beta(d_i))
|
|
\end{align}
|
|
Where beta is indexed by
|
|
$d \in \{1,2,\dots,21,22\}$
|
|
for each general ICD-10 category.
|
|
The $\beta$s are distributed
|
|
\begin{align}
|
|
\beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
|
|
\end{align}
|
|
With hyperpriors
|
|
%Checked on 2024-11-27. Is corrrect.
|
|
\begin{align}
|
|
\mu_k \sim \text{Normal}(0,0.05) \\
|
|
\sigma_k \sim \text{Gamma}(4,20)
|
|
\end{align}
|
|
\todo{Double check actual spec}
|
|
|
|
|
|
The independent variables include:
|
|
\begin{subequations}
|
|
\begin{align}
|
|
x_{i,n}\beta(d_i)
|
|
= & \bx{1}{\text{Elapsed Duration}} \\
|
|
&+ \bx{2}{\arcsinh \left(\text{\# Generic compunds}\right)} \\
|
|
&+ \bx{3}{\arcsinh \left(\text{\# Branded compunds}\right)} \\
|
|
&+ \bx{4}{\text{\# DALYs in High SDI Countries}} \\
|
|
&+ \bx{5}{\text{\# DALYs in High-Medium SDI Countries}} \\
|
|
&+ \bx{6}{\text{\# DALYs in Medium SDI Countries}} \\
|
|
&+ \bx{7}{\text{\# DALYs in Low-Medium SDI Countries}} \\
|
|
&+ \bx{8}{\text{\# DALYs in Low SDI Countries}} \\
|
|
&+ \bxi{9}{\text{Not yet Recruiting}}{\text{Trial Status}}\\
|
|
&+ \bxi{10}{\text{Recruiting}}{\text{Trial Status}}\\
|
|
&+ \bxi{11}{\text{Enrolling by Invitation Only}}{\text{Trial Status}}\\
|
|
&+ \bxi{12}{\text{Active, not recruiting}}{\text{Trial Status}}
|
|
\end{align}
|
|
\end{subequations}
|
|
The arcsinh transform is used because it is similar to a log transform but
|
|
differentiably handles counts of zero and
|
|
$\text{arcsinh}(0) = \ln (0 + \sqrt{0^2 + 1}) =0$.
|
|
%%%% Not sure if space should go here. I think these work well together.
|
|
Some of the other variables are implicitly controlled for as they are used
|
|
to select the trials of interest.
|
|
These include:
|
|
\begin{itemize}
|
|
\item The trial is Phase 3.
|
|
\item The trial has a Data Monitoring Committee.
|
|
\item The compounds are FDA regulated drugs.
|
|
\item The trial was never suspended\footnote{
|
|
This was because I wasn't sure how to handle it in the model
|
|
when I started scraping the data.
|
|
Later the website changed.
|
|
This is technically post selection.
|
|
\todo{double check where this happened in the code.
|
|
I may have only done it in the CBO analysis.}
|
|
}
|
|
\end{itemize}
|
|
|
|
\subsection{Interpretation}
|
|
% Explain
|
|
% - What do we care about? Changes in the probability of
|
|
% - distribution of differences -> relate to E(\delta Y)
|
|
% - How do we obtain this distribution of differences?
|
|
% - from the model, we pay attention to P under treatment and control
|
|
% - We obtain this by fitting the model, then simulating under treatment and control, and taking the difference in the probability.
|
|
% -
|
|
|
|
The specific measure of interest is how much a delay in
|
|
closing enrollment changes the probability of terminating a trial
|
|
$p_{i,n}$ in the model.
|
|
|
|
In the standard reduced form causal inference, the treatment effect
|
|
of interest for outcome $Z$ is measured as
|
|
\begin{align}
|
|
E(Z(\text{Treatment}) - Z(\text{Control}))
|
|
= E(Z(\text{Treatment})) - E(Z(\text{Control}))
|
|
\end{align}
|
|
Because $Z(\text{Treatment})$ and $Z(\text{Control})$ are random variables,
|
|
$Z(\text{Treatment}) - Z(\text{Control}) = \delta_Z$, is also a random variable.
|
|
In the bayesian framework, this parameter has a distribution, and so
|
|
we can calculate the distribution of differences in
|
|
the probability of termination due to a given delay in
|
|
closing recrutiment,
|
|
$p_{i,n}(T) - p_{i,n}(C) = \delta_{p_{i,n}}$.
|
|
|
|
I calculate the posterior distribution of $\delta_{p_{i,n}}$ by estimating the
|
|
posterior distributions of the $\beta$s and then simulating $\delta_{p_{i,n}}$.
|
|
This involves taking a draw from the $\beta$s distribution, calculating
|
|
$p_{i,n}(C)$
|
|
for the underlying trials at the snapshot when they close enrollment
|
|
and then calculating
|
|
$p_{i,n}(T)$
|
|
under the counterfactual where enrollment had not yet closed.
|
|
The difference
|
|
$\delta_{p_{i,n}}$
|
|
is then calculated for each trial, and saved.
|
|
After repeating this for all the posterior samples and
|
|
all trials at their point of close, we have an esitmate
|
|
for the posterior distribution of differences between treatement and control
|
|
for selected trials.
|
|
|
|
|
|
\end{document}
|