You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
JobMarketPaper/Paper/sections/04_EconometricModel.tex

129 lines
4.9 KiB
TeX

\documentclass[../Main.tex]{subfiles}
\graphicspath{{\subfix{Assets/img/}}}
\begin{document}
%% Describe goal
The model I use is a
hierarchal logistic regression model where the
hierarchies correspond to the 22 top-level ICD-10 disease categories.
The goal is to take each snapshot and predict the probability of termination.
First, some notation:
\begin{itemize}
\item $i$: indexes trials
\item $n$: indexes trial snapshots.
\item $y_i$: whether each trial
terminated (true, 1) or completed (false, 0).
\item $d_i$: indexes the ICD-10 disease category of the trial.
\item $x_{i,n}$: represents the independent
variables associated with the snapshot.
\end{itemize}
The actual specification of the model to measure
the direct effect of enrollment is:
\begin{align}
y_i \sim \text{Bernoulli}(p_{i,n}) \\
p_{i,n} = \text{logit}(x_{i,n} \vec \beta(d_i))
\end{align}
Where beta is indexed by
$d \in \{1,2,\dots,21,22\}$
for each general ICD-10 category.
The $\beta$s are distributed
\begin{align}
\beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
\end{align}
With hyperpriors
%Checked on 2024-11-27. Is corrrect.
\begin{align}
\mu_k \sim \text{Normal}(0,0.05) \\
\sigma_k \sim \text{Gamma}(4,20)
\end{align}
\todo{Double check actual spec}
The independent variables include:
\begin{subequations}
\begin{align}
x_{i,n}\beta(d_i)
= & \bx{1}{\text{Elapsed Duration}} \\
&+ \bx{2}{\arcsinh \left(\text{\# Generic compunds}\right)} \\
&+ \bx{3}{\arcsinh \left(\text{\# Branded compunds}\right)} \\
&+ \bx{4}{\text{\# DALYs in High SDI Countries}} \\
&+ \bx{5}{\text{\# DALYs in High-Medium SDI Countries}} \\
&+ \bx{6}{\text{\# DALYs in Medium SDI Countries}} \\
&+ \bx{7}{\text{\# DALYs in Low-Medium SDI Countries}} \\
&+ \bx{8}{\text{\# DALYs in Low SDI Countries}} \\
&+ \bxi{9}{\text{Not yet Recruiting}}{\text{Trial Status}}\\
&+ \bxi{10}{\text{Recruiting}}{\text{Trial Status}}\\
&+ \bxi{11}{\text{Enrolling by Invitation Only}}{\text{Trial Status}}\\
&+ \bxi{12}{\text{Active, not recruiting}}{\text{Trial Status}}
\end{align}
\end{subequations}
The arcsinh transform is used because it is similar to a log transform but
differentiably handles counts of zero and
$\text{arcsinh}(0) = \ln (0 + \sqrt{0^2 + 1}) =0$.
%%%% Not sure if space should go here. I think these work well together.
Some of the other variables are implicitly controlled for as they are used
to select the trials of interest.
These include:
\begin{itemize}
\item The trial is Phase 3.
\item The trial has a Data Monitoring Committee.
\item The compounds are FDA regulated drugs.
\item The trial was never suspended\footnote{
This was because I wasn't sure how to handle it in the model
when I started scraping the data.
Later the website changed.
This is technically post selection.
\todo{double check where this happened in the code.
I may have only done it in the CBO analysis.}
}
\end{itemize}
\subsection{Interpretation}
% Explain
% - What do we care about? Changes in the probability of
% - distribution of differences -> relate to E(\delta Y)
% - How do we obtain this distribution of differences?
% - from the model, we pay attention to P under treatment and control
% - We obtain this by fitting the model, then simulating under treatment and control, and taking the difference in the probability.
% -
The specific measure of interest is how much a delay in
closing enrollment changes the probability of terminating a trial
$p_{i,n}$ in the model.
In the standard reduced form causal inference, the treatment effect
of interest for outcome $Z$ is measured as
\begin{align}
E(Z(\text{Treatment}) - Z(\text{Control}))
= E(Z(\text{Treatment})) - E(Z(\text{Control}))
\end{align}
Because $Z(\text{Treatment})$ and $Z(\text{Control})$ are random variables,
$Z(\text{Treatment}) - Z(\text{Control}) = \delta_Z$, is also a random variable.
In the bayesian framework, this parameter has a distribution, and so
we can calculate the distribution of differences in
the probability of termination due to a given delay in
closing recrutiment,
$p_{i,n}(T) - p_{i,n}(C) = \delta_{p_{i,n}}$.
I calculate the posterior distribution of $\delta_{p_{i,n}}$ by estimating the
posterior distributions of the $\beta$s and then simulating $\delta_{p_{i,n}}$.
This involves taking a draw from the $\beta$s distribution, calculating
$p_{i,n}(C)$
for the underlying trials at the snapshot when they close enrollment
and then calculating
$p_{i,n}(T)$
under the counterfactual where enrollment had not yet closed.
The difference
$\delta_{p_{i,n}}$
is then calculated for each trial, and saved.
After repeating this for all the posterior samples and
all trials at their point of close, we have an esitmate
for the posterior distribution of differences between treatement and control
for selected trials.
\end{document}