\documentclass[../Main.tex]{subfiles} \graphicspath{{\subfix{Assets/img/}}} \begin{document} %% Describe goal The model I use is a hierarchal logistic regression model where the hierarchies are based on disease categories. %%NOTATION % change notation % i indexes trials for y and d % n indexes snapshots within the trial First, some notation: \begin{itemize} \item $i$: indexes trials \item $n$: indexes trial snapshots. \item $y_i$: whether each trial terminated (true, 1) or completed (false, 0). \item $d_i$: indexes the ICD-10 disease category of the trial. \item $x_{i,n}$: represents the independent variables associated with the snapshot. \end{itemize} The goal is to take each snapshot and predict The actual specification of the model to measure the direct effect of enrollment is: \begin{align} y_i \sim \text{Bernoulli}(p_{i,n}) \\ p_{i,n} = \text{logit}(x_{i,n} \vec \beta(d_i)) \end{align} Where beta is indexed by $d \in \{1,2,\dots,21,22\}$ for each general ICD-10 category. The betas are distributed \begin{align} \beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I) \end{align} With hyperpriors %Checked on 2024-11-27. Is corrrect. \todo{Double check that these are the priors I used.} \begin{align} \mu_k \sim \text{Normal}(0,0.05) \\ \sigma_k \sim \text{Gamma}(4,20) \end{align} \todo{Double check actual spec} The independent variables include: \todo{Make sure data is described before this point.} \begin{subequations} \begin{align} x_{i,n}\beta(d_i) = & \bx{1}{\text{Elapsed Duration}} \\ &+ \bx{2}{\arcsinh \left(\text{\# Generic compunds}\right)} \\ &+ \bx{3}{\arcsinh \left(\text{\# Branded compunds}\right)} \\ &+ \bx{4}{\text{\# DALYs in High SDI Countries}} \\ &+ \bx{5}{\text{\# DALYs in High-Medium SDI Countries}} \\ &+ \bx{6}{\text{\# DALYs in Medium SDI Countries}} \\ &+ \bx{7}{\text{\# DALYs in Low-Medium SDI Countries}} \\ &+ \bx{8}{\text{\# DALYs in Low SDI Countries}} \\ &+ \bxi{9}{\text{Not yet Recruiting}}{\text{Trial Status}}\\ &+ \bxi{10}{\text{Recruiting}}{\text{Trial Status}}\\ &+ \bxi{11}{\text{Enrolling by Invitation Only}}{\text{Trial Status}}\\ &+ \bxi{12}{\text{Active, not recruiting}}{\text{Trial Status}} \end{align} \end{subequations} The arcsinh transform is used because it is similar to a log transform but differentiably handles counts of zero since $\text{arcsinh}(0) = \ln (0 + \sqrt{0^2 + 1}) =0$. Note that in this is a heirarchal model, each IDC-10 disease category gets it's own set of parameters, and that is why the $\beta$s are parameterized by $d_i$. %%%% Not sure if space should go here. I think these work well together. Other variables are implicitly controlled for as they are used to select the trials of interest. These include: \todo{double check these in the code.} \begin{itemize} \item The trial is Phase 3. \item The trial has a Data Monitoring Committee. \item The compounds are FDA regulated drug. \item The trial was never suspended\footnote{ This was because I wasn't sure how to handle it in the model when I started scraping the data. Later the website changed. This is technically post selection. \todo{double check where this happened in the code. I may have only done it in the CBO analysis.} } \end{itemize} \subsection{Interpretation} % Explain % - What do we care about? Changes in the probability of % - distribution of differences -> relate to E(\delta Y) % - How do we obtain this distribution of differences? % - from the model, we pay attention to P under treatment and control % - We obtain this by fitting the model, then simulating under treatment and control, and taking the difference in the probability. % - The specific measure of interest is how much a delay in closing enrollment changes the probability of terminating a trial $p_{i,n}$ in the model. In the standard reduced form causal inference, the treatment effect of interest for outcome $Z$ is measured as \begin{align} E(Z(\text{Treatment}) - Z(\text{Control})) = E(Z(\text{Treatment})) - E(Z(\text{Control})) \end{align} Because $Z(\text{Treatment})$ and $Z(\text{Control})$ are random variables, $Z(\text{Treatment}) - Z(\text{Control}) = \delta_Z$, is also a random variable. In the bayesian framework, this parameter has a distribution, and so we can calculate the distribution of differences in the probability of termination due to a given delay in closing recrutiment, $p_{i,n}(T) - p_{i,n}(C) = \delta_{p_{i,n}}$. I calculate the posterior distribution of $\delta_{p_{i,n}}$ by estimating the posterior distributions of the $\beta$s and then simulating $\delta_{p_{i,n}}$. This involves taking a draw from the $\beta$s distribution, calculating $p_{i,n}(C)$ for the underlying trials at the snapshot when they close enrollment and then calculating $p_{i,n}(T)$ under the counterfactual where enrollment had not yet closed. The difference $\delta_{p_{i,n}}$ is then calculated for each trial, and saved. After repeating this for all the posterior samples, we have an esitmate for the posterior distribution of differences between treatement and control. \end{document}