\documentclass[../Main.tex]{subfiles}
\graphicspath{{\subfix{Assets/img/}}}

\begin{document}
%% Describe goal

The model I use is a 
hierarchal logistic regression model where the 
hierarchies are based on disease categories.
%%NOTATION
% change notation
% i indexes trials for y and d 
% n indexes snapshots within the trial

First, some notation:
\begin{itemize}
    \item $i$: indexes trials
    \item $n$: indexes trial snapshots.
    \item $y_i$: whether each trial 
        terminated (true, 1) or completed (false, 0).
    \item $d_i$: indexes the ICD-10 disease category of the trial.
    \item $x_{i,n}$: represents the independent 
        variables associated with the snapshot.
\end{itemize} 

The goal is to take each snapshot and predict 
The actual specification of the model to measure 
the direct effect of enrollment is:
\begin{align}
    y_i \sim \text{Bernoulli}(p_{i,n}) \\
        p_{i,n} = \text{logit}(x_{i,n} \vec \beta(d_i))
\end{align}
Where beta is indexed by 
$d \in \{1,2,\dots,21,22\}$ 
for each general ICD-10 category.
The betas are distributed
\begin{align}
    \beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
\end{align}
With hyperpriors
%Checked on 2024-11-27. Is corrrect. \todo{Double check that these are the priors I used.}
\begin{align}
    \mu_k \sim \text{Normal}(0,0.05) \\
    \sigma_k \sim \text{Gamma}(4,20)
\end{align}
\todo{Double check actual spec}


The independent variables include: 
\todo{Make sure data is described before this point.}
\begin{subequations}
\begin{align}
    x_{i,n}\beta(d_i) 
        = & \bx{1}{\text{Elapsed Duration}} \\
        &+ \bx{2}{\arcsinh \left(\text{\# Generic compunds}\right)} \\
        &+ \bx{3}{\arcsinh \left(\text{\# Branded compunds}\right)} \\ 
        &+ \bx{4}{\text{\# DALYs in High SDI Countries}} \\
        &+ \bx{5}{\text{\# DALYs in High-Medium SDI Countries}} \\
        &+ \bx{6}{\text{\# DALYs in Medium SDI Countries}} \\
        &+ \bx{7}{\text{\# DALYs in Low-Medium SDI Countries}} \\
        &+ \bx{8}{\text{\# DALYs in Low SDI Countries}} \\
        &+ \bxi{9}{\text{Not yet Recruiting}}{\text{Trial Status}}\\
        &+ \bxi{10}{\text{Recruiting}}{\text{Trial Status}}\\
        &+ \bxi{11}{\text{Enrolling by Invitation Only}}{\text{Trial Status}}\\
        &+ \bxi{12}{\text{Active, not recruiting}}{\text{Trial Status}}
\end{align}
\end{subequations}
The arcsinh transform is used because it is similar to a log transform but
differentiably handles counts of zero since 
$\text{arcsinh}(0) = \ln (0 + \sqrt{0^2 + 1}) =0$.
Note that in this is a heirarchal model, each IDC-10 disease category 
gets it's own set of parameters, and that is why the $\beta$s are parameterized
by $d_i$.
%%%% Not sure if space should go here. I think these work well together.
Other variables are implicitly controlled for as they are used 
to select the trials of interest.
These include:
        \todo{double check these in the code.}
\begin{itemize}
    \item The trial is Phase 3.
    \item The trial has a Data Monitoring Committee.
    \item The compounds are FDA regulated drug.
    \item The trial was never suspended\footnote{
        This was because I wasn't sure how to handle it in the model
        when I started scraping the data. 
        Later the website changed.
        This is technically post selection. 
        \todo{double check where this happened in the code. 
        I may have only done it in the CBO analysis.}
    }
\end{itemize}

\subsection{Interpretation}
% Explain 
% - What do we care about? Changes in the probability of 
% - distribution of differences -> relate to E(\delta Y)
% - How do we obtain this distribution of differences?
%   - from the model, we pay attention to P under treatment and control
%   - We obtain this by fitting the model, then simulating under treatment and control, and taking the difference in the probability.
%   - 

The specific measure of interest is how much a delay in 
closing enrollment changes the probability of terminating a trial
$p_{i,n}$ in the model.

In the standard reduced form causal inference, the treatment effect
of interest for outcome $Z$ is measured as 
\begin{align}
    E(Z(\text{Treatment}) - Z(\text{Control})) 
    = E(Z(\text{Treatment})) - E(Z(\text{Control}))
\end{align}
Because $Z(\text{Treatment})$ and $Z(\text{Control})$ are random variables,
$Z(\text{Treatment}) - Z(\text{Control}) = \delta_Z$, is also a random variable. 
In the bayesian framework, this parameter has a distribution, and so 
we can calculate the distribution of differences in 
the probability of termination due to a given delay in 
closing recrutiment,
$p_{i,n}(T) - p_{i,n}(C) = \delta_{p_{i,n}}$.

I calculate the posterior distribution of $\delta_{p_{i,n}}$ by estimating the 
posterior distributions of the $\beta$s and then simulating $\delta_{p_{i,n}}$.
This involves taking a draw from the $\beta$s distribution, calculating
$p_{i,n}(C)$ 
for the underlying trials at the snapshot when they close enrollment
and then calculating 
$p_{i,n}(T)$ 
under the counterfactual where enrollment had not yet closed.
The difference 
$\delta_{p_{i,n}}$ 
is then calculated for each trial, and saved. 
After repeating this for all the posterior samples, we have an esitmate 
for the posterior distribution of differences between treatement and control.


\end{document}