JobMarketPaper/Paper/sections/10_CausalStory.tex

\documentclass[../Main.tex]{subfiles}
\graphicspath{{\subfix{Assets/img/}}}

\begin{document}

% Begin by talking about goal, what does it mean? This might need some work prior to give more background.
As I am trying to separate strategic concerns
(the effect of a marginal treatment methodology)
and an operational concern
(the effect of a delay in closing enrollment),
we need to look at what confounds these effects and how we might measure them.

The primary effects one might expect to see are that
\begin{enumerate}
    \item Adding more drugs to the market will make it harder to
        finish a trial as it is
        more likely to be terminated due to concerns about profitabilty.
    \item Adding more drugs will make it harder to recruit, slowing enrollment.
    \item Enrollment challenges increase the likelihood that a trial will
        terminate.
    % Mentioned below
    % \item A large population/market will tends to have more drugs to treat it
    %     because it is more profitable.
    % \item A large population/market will make it easier to recruit,
    %     reducing the likelihood of a termination due to enrollment failure.
\end{enumerate}

There are a few fundamental issues that arise when trying to estimate
these effects.
The first is that the severity of the disease and the size of the population
who has that disease affects the ease of enrolling participants.
For example, a large population may make it easier to find enough participants
to achieve the required statistical discrimination between
control and treatment.
Second, for some diseases there exists an endogenous dynamic
between the treatments available for a disease and the
market size/population with that disease.
\authorcite{cerda_EndogenousInnovations_2007} proposes two mechanisms
that link the drugs on the market and market size.
The inverse is that for many chronic diseases with high mortality rates,
more drugs cause better survivability, increasing the size of those markets.
The third major confound is that the drugs on the market affect enrollment.
If there is a treatment already on the market, patients or their doctors
may be less inclined to participate in the trial, even if the current treatment
has severe downsides.

There are additional problems.
One is in that the disease being treated affects the
safety and efficacy standards that the drug will be held too.
For example, if a particular cancer is very deadly and does not respond well
to current treatments, Phase I trials will enroll patients with that cancer,
as opposed to the standard of enrolling healthy volunteers
\cite{commissioner_DrugDevelopment_2020} to establish safe dosages.
The trial is more likely to be terminated early if the drug is unsafe or has no
discernabile effect, therefore termination depends in part on a compound-disease
interaction.
Another challenge comes from the interaction between duration and termination;
in that if a trial terminates before closing enrollment for issues other
than enrollment, then the enrollment will still be low.
On the other hand, if enrollment is low, the trial might terminate.
These outcomes are indistinguishable in the data provided by the final
\url{ClinicalTrials.gov} dataset.

Finally, while conducting a trial, the safety and efficacy of a drug are driven by
fundamental pharmacokinetic properties of the compounds.
These are only imperfectly measured both prior to and during any given trial.
Previously measured safety and efficacy inform the decision to start the trial
in the first place while currently observed safety and efficiency results
help the sponsor judge whether or not to continue the trial.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Data Summary}
%% Describe data here
Since Sep 27th, 2007 those who conduct clinical trials of FDA controlled
drugs or devices on human subjects must register
their trial at \url{ClinicalTrials.gov}
(\cite{noauthor_fdaaa_nodate}).
This involves submitting information on the expected enrollment and duration of
trials, drugs or devices that will be used, treatment protocols and study arms,
as well as contact information the trial sponsor and treatment sites.

When starting a new trial, the required information must be submitted
``\dots not later than 21 calendar days after enrolling the first human subject\dots''.
After the initial submission, the data is briefly reviewed for quality and
then the trial record is published and the trial is assigned a
National Clinical Trial (NCT) identifier.
\cite{noauthor_fdaaa_nodate}.

Each trial's record is updated periodically, including a final update that must occur
within a year of completing the primary objective, although exceptions are
available for trials related to drug approvals or for trials with secondary
objectives that require further observation\footnote{This rule came into effect in 2017}
\cite{noauthor_fdaaa_nodate}.
Other than the requirements for the the first and last submissions, all other
updates occur at the discresion of the trial sponsor.
Because the ClinicalTrials.gov website serves as a central point of information
on which trials are active or recruting for a given condition or drug,
most trials are updated multiple times during their progression.

There are two primary ways to access data about clinical trials.
The first is to search individual trials on ClinicalTrials.gov with a web browser.
This web portal shows the current information about the trial and provides
access to snapshots of previously submitted information.
Together, these features fulfill most of the needs of those seeking
to join a clinical trial.
For this project I've been able to scrape these historical records to establish
snapshots of the records provided.
%include screenshots?
The second way to access the data is through a normalized database setup by
the
\href{https://aact.ctti-clinicaltrials.org/}{Clinical Trials Transformation Initiative}
called AACT. %TODO: Get CITATION
The AACT database is available as a PostgreSQL database dump or set of
flat-files.
These dumps match a near-current version of the ClinicalTrials.gov database.
This format is ameniable to large scale analysis, but does not contain
information about the past state of trials.
I combined these two sources, using the AACT dataset to select
trials of interest and then scraping \url{ClinicalTrials.gov} to get
a timeline of each trial.

%%%%%%%%%%%%%%%%%%%%%%%% Model Outline

The way I use this data is to predict the final status of the trial
from the snapshots that were taken, in effect asking:
``how does the probability of a termination change from the current state
of the trial if X changes?''

%% Return to causal identification
\subsection{Causal Identification}

Because running experiments on companies running clinical trials is not going
to happen anytime soon, causal identification depends on using a
structural causal model.
Because the data generating process for the clinical trials records is rather
straightforward, this is an ideal place to use
\authorcite{pearl_causality_2000}
Do-Calculus.
This process involves describing the data generating process in the form of
a directed acyclic graph, where the nodes represent different variables
within the causal model and the directed edges (arrows) represent
assumptions about which variables influence the other variables.
There are a few algorithms that then tell the researcher which of the
relationships will be confounded, which ones can be statistically estimated,
and provides some hypotheses that can be tested to ensure the model is
reasonably correct.


In \cref{Fig:CausalModel} I diagram the directed acyclic graph that describes
my proposed data generating process,
It revolves around the decisions made by the study sponsor,
who must decide whether to let a trial run to completion
or terminate the trial early.
While receiving updates regarding the status of the trial, they ask questions
such as:
\begin{itemize}
    \item Do I need to terminate the trial due to safety incidents?
    \item Does it appear that the drug is effective enough to achieve our
        goals, justifying continuing the trial?
    \item Are we recruiting enough participants to achive the statistical
        results we need in the budget we have?
    \item Does the current market conditions and expectations about returns on
        investment justify the expenditures we are making?
\end{itemize}
When appropriate issues arise, the study sponsor terminates the trial, otherwise
it continues to completion.

\begin{figure}[H] %use [H] to fix the figure here.
    \frame{
    \scalebox{0.65}{
             \tikzfig{../assets/tikzit/CausalGraph2}
    }
    }
    \todo{check if this is the correct graph}
    \caption{Graphical Causal Model}

    % \small{Crimson boxes are the variables of interest,
    % white boxes are unobserved, while the gray boxes will be controlled for.}
    \label{Fig:CausalModel}
\end{figure}


% Constructing the model more explicitly
% - quickly describe each node and line.
\todo{I think I need to blend the data section in before this, to give some overall information on data.}
\todo{I may need to add some information on snapshots so that this makes sense.}

A quick summary of the nodes of the DAG, the exact representation in the data, and their impact:
\begin{itemize}
    \item Main Interests (Crimson Boxes)
        \begin{enumerate}
            \item \texttt{Will Terminate?}:
                If the final status of the trial was \textit{terminated}
                and comes from the AACT dataset.
                or \textit{completed}.
            \item \texttt{Enrollment Status}:
                    This describes the current enrollment status of the snapshot, e.g.
                    \texttt{Recruiting},
                    \texttt{Enrolling by invitation only},
                    or
                    \texttt{Active, not recruting}.
            \item \texttt{Market Measures}:
                Various measures of the number of alternate drugs on the market.
                These are either the number of other drugs with the same active ingredient as the trial
                (both generic and originators),
                and those considered alternatives in various formularies published by the United States Pharmacopeia.
        \end{enumerate}
    \item Observed Confounders (Gray Boxes)
        \begin{enumerate}
            \item \texttt{Condition}:
                The underlying condition, classified by IDC-10 group.
                This impacts every other aspect of the model and is pulled from
                the AACT dataset.
            \item \texttt{Population (market size)}:
                Multiple measures of the impact the disease.
                These are measured by the DALY cost of the disease, and is
                separated by the impact on countries with
                High, High-Medium, Medium, Medium-Low, and Low
                development scores.
                This data comes from the Institute for Health Metrics' Global Burden of Disease study.
            \item \texttt{Elapsed Duration}:
                A normalized measure of the time elapsed in the trial.
                Comes from the original estimate of the trial's primary completion date and the registered start date.
                I take the difference in days between these, and get the percentage of that time that has elapsed.
                This calculation is based on data from the snapshots and the
                AACT final results.
            \item \texttt{Decision to Proceed with Phase III}:
                If the compound development has progressed to Phase III.
                This is included in the analysis by only including
                Phase III trials registered in the AACT dataset.
        \end{enumerate}
    \item Unobserved Confounders (White Boxes)
        \begin{enumerate}
            \item \texttt{Fundamental Efficacy and Safety}:
                The underlying safety of the compound.
                Cannot be observed, only estimated through scientific study.
            \item \texttt{Previously observed Efficacy and Safety}:
                The information gathered in previous studies.
                This is not available in my dataset because I don't
                have links to prior studies.
            \item \texttt{Currently observed Efficiency and Safety}:
                The information gathered during this study.
                This is only partially available, and so is
                treated as unavailable.
                After a study is over, the investigators are
                often publish information about adverse events, but only
                those that meet a certain threshold.
                As this information doesn't appear to be provided to
                participants, we don't consider it.
        \end{enumerate}
\end{itemize}

%

\begin{itemize}
    \item Relationships of interest
        \begin{enumerate}
            \item \texttt{Enrollment Status} $\rightarrow$ \texttt{Will Terminate?}:
                This is the primary effect of interest.
            \item \texttt{Market Measures} $\rightarrow$ \texttt{Will Terminate?}:
                This is the secondary effect of interest.
        \end{enumerate}
    \item Confounding Pathways
        \begin{enumerate}
            \item
                \texttt{Condition}:
                Affects every other node.
                Part of the Adjustment Set.
            \item Backdoor Pathway
                between \texttt{Will Terminate?} and
                \texttt{Enrollment Status} through safety and efficiency.
                The concern is that since previously learned information
                and current information are driven by the same underlying
                physical reality, the enrollment process and
                termination decisions may be correlated.
                Controlling for the decision to proceed with the trial is the
                best adjustment available to block this confounding pathway.
                Below I describe the exact pathways.
                \begin{enumerate}
                    \item
                        \texttt{Fundamental Efficacy and Safety}
                        $\rightarrow$
                        \texttt{Currently Observed Efficacy and Safety}:
                        This relationship represents the measurements of
                        safety and efficacy in the current trial.
                    \item
                        \texttt{Currently Observed Efficacy and Safety}:
                        $\rightarrow$
                        \texttt{Will Terminate?}:
                        This is how the measurements of safety and efficacy in the
                        current trial affect the probability of termination.
                        % typically, evidence of a lack safety or efficacy is
                        % enought to terminate the trial.
                    \item \texttt{Fundamental Efficacy and Safety}
                        $\rightarrow$
                        \texttt{Previously Observed Efficacy and Safety}:
                        This relationship represents the measurements of
                        safety and efficacy in work prior to the current trial.
                    \item
                        \texttt{Previously Observed Efficacy and Safety}:
                        $\rightarrow$
                        \texttt{Decision to proceed with Phase III}:
                        Previously observed data is essential to the FDA's
                        decision to allow a phase III trial.
                \end{enumerate}
            \item
                Backdoor Pathway from \texttt{Market Status}
                to \texttt{Enrollment}
                through \texttt{Population}.
                The concern with this pathway is that the rate of enrollment, and
                thus the enrollment status, is affected by the Population with
                the disease.
                Additionally, there is a concern that the number of competitors
                is driven by the total market size.
                Thus adding Population to the adjustment set is necessary.
                \begin{enumerate}
                    \item
                        \texttt{Population}
                        $\rightarrow$
                        \texttt{Enrollment Status}:
                        This is fairly straightforward.
                        How easy it is to enroll participants depends in part
                        on how many people have the disease.
                    \item
                        \texttt{Population}
                        $\rightarrow$
                        \texttt{Market Measures}:
                        This assumes that the population effect flows only one
                        direction, i.e. that a large population size increases
                        the likelihood of a large number of drugs.
                        %TODO: Think about this one a bit because it does mess
                        % with identification, particularly of market effects.
                        % these two are jointly determined per cerda 2007.
                        % If I can't justify separating them, then I'll need to
                        % merge population (market size) and market measures (drugs on market).
                \end{enumerate}
            \item
                \texttt{Market Measures}
                $\rightarrow$
                \texttt{Enrollment Status}:
                This confounds the estimation of the effect of
                \texttt{Enrollment} on \texttt{Will Terminate?}, and
                so \texttt{Market Measures} is part of the adjustment set.
            \item
                \texttt{Market Measures}
                $\rightarrow$
                \texttt{Decision to proceed with Phase III}:
                The alternative treatments on the market will affect a sponsors'
                decision to move forward with a Phase III trial.
                This is controlled for by only working with trials that
                successfully begin recruitment for a Phase III Trial.
            \item
                \texttt{Elapsed Duration}
                $\rightarrow$
                \texttt{Will Terminate?}:
                The amount of time past helps drive the decision to continue
                or terminate.
            \item
                \texttt{Enrollment Status}
                $\leftrightarrow$
                \texttt{Elapsed Duration}:
                % This is jointly determined. and the weakest part of the causal identification without an accurate model of enrollment.
                This is one of the weakest parts of the causal inference.
                Without a well defined model of enrollment, we can't separate
                the interaction between the enrollment status and the elapsed
                duration.
                For example, if enrollment is running slower than expected,
                the trial may be terminated due to concerns that it will not
                achive the primary objectives or that costs will exceed
                the budget allocated to the project.
            \item
                \texttt{Decision to Proceed with Phase III}
                $\rightarrow$
                \texttt{Will Terminate?}:
                %obviously required. Maybe remove from listing and graph?
                This effect is fairly straightforward, in that
                there is no possibility of a termination or completion
                if the trial does not start.
                This is here to block a backdoor pathway between
                \texttt{Will Terminate?} and the enrollment status
                through \texttt{Previously observed Safety and Efficacy}.
        \end{enumerate}
\end{itemize}
\end{document}