From 0bea917eb34d118f44ed218ca5c7ac58f662eba7 Mon Sep 17 00:00:00 2001
From: will king <youainti@protonmail.com>
Date: Mon, 25 Nov 2024 10:37:19 -0800
Subject: [PATCH] recording most recent updates

---
 Latex/Paper/outliin4.txt                |  18 ++++
 Latex/Paper/sections/10_CausalStory.tex | 104 ++++++++++++++++++++----
 2 files changed, 108 insertions(+), 14 deletions(-)
 create mode 100644 Latex/Paper/outliin4.txt

diff --git a/Latex/Paper/outliin4.txt b/Latex/Paper/outliin4.txt
new file mode 100644
index 0000000..959a441
--- /dev/null
+++ b/Latex/Paper/outliin4.txt
@@ -0,0 +1,18 @@
+NEXT STEPS IN WRITING
+
+- insert a description of the general approach I use: 
+    - predicting, based on snapshots, the likelihood of termination.
+    - this needs to go between the description of the snapshots and the 
+    causal inference introduction.
+- Then I can use what I've written about the graph, and follow up with more information about the data.
+
+Overall this would look like
+
+- [x] Introduction of the question and general issues of confoundedness.
+- [x] Clinical Trials Data Sources
+- [x] Explain basic econometric modelling approach
+- [ ] Then explain the graph, nodes, and confoundedness in more detail
+- [ ] Then go over the rest of the data.
+- [ ] Finally
+    - Discuss the number of datapoints.
+    - review major challenges to causal identification. (no enrollment model small data size)
diff --git a/Latex/Paper/sections/10_CausalStory.tex b/Latex/Paper/sections/10_CausalStory.tex
index 9ce9614..331f4f1 100644
--- a/Latex/Paper/sections/10_CausalStory.tex
+++ b/Latex/Paper/sections/10_CausalStory.tex
@@ -10,9 +10,10 @@ and an operational concern
 (the effect of a delay in closing enrollment), 
 we need to look at what confounds these effects and how we might measure them.
 
-The primary effects one expects to see are that
+The primary effects one might expect to see are that
 \begin{enumerate}
-    \item Adding more drugs will make it harder to finish a trial as it is
+    \item Adding more drugs to the market will make it harder to 
+        finish a trial as it is
         more likely to be terminated due to concerns about profitabilty.
     \item Adding more drugs will make it harder to recruit, slowing enrollment.
     \item Enrollment challenges increase the likelihood that a trial will 
@@ -45,11 +46,11 @@ has severe downsides.
 
 There are additional problems. 
 One is in that the disease being treated affects the 
-safety and efficacy profile that the drug will be held too. 
+safety and efficacy standards that the drug will be held too. 
 For example, if a particular cancer is very deadly and does not respond well
 to current treatments, Phase I trials will enroll patients with that cancer, 
 as opposed to the standard of enrolling healthy volunteers 
-\cite{commissioner_DrugDevelopment_2020}.
+\cite{commissioner_DrugDevelopment_2020} to establish safe dosages.
 The trial is more likely to be terminated early if the drug is unsafe or has no
 discernabile effect, therefore termination depends in part on a compound-disease 
 interaction.
@@ -67,9 +68,70 @@ Previously measured safety and efficacy inform the decision to start the trial
 in the first place while currently observed safety and efficiency results
 help the sponsor judge whether or not to continue the trial.
 
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Clinical Trials Data Sources}
+%% Describe data here
+Since Sep 27th, 2007 those who conduct clinical trials of FDA controlled 
+drugs or devices on human subjects must register 
+their trial at \url{ClinicalTrials.gov}
+(\cite{noauthor_fdaaa_nodate}).
+This involves submitting information on the expected enrollment and duration of
+trials, drugs or devices that will be used, treatment protocols and study arms, 
+as well as contact information the trial sponsor and treatment sites.
+
+When starting a new trial, the required information must be submitted 
+``\dots not later than 21 calendar days after enrolling the first human subject\dots''.
+After the initial submission, the data is briefly reviewed for quality and 
+then the trial record is published and the trial is assigned a 
+National Clinical Trial (NCT) identifier.
+\cite{noauthor_fdaaa_nodate}.
+
+Each trial's record is updated periodically, including a final update that must occur 
+within a year of completing the primary objective, although exceptions are
+available for trials related to drug approvals or for trials with secondary
+objectives that require further observation\footnote{This rule came into effect in 2017}
+\cite{noauthor_fdaaa_nodate}.
+Other than the requirements for the the first and last submissions, all other
+updates occur at the discresion of the trial sponsor.
+Because the ClinicalTrials.gov website serves as a central point of information
+on which trials are active or recruting for a given condition or drug,
+most trials are updated multiple times during their progression.
+
+There are two primary ways to access data about clinical trials.
+The first is to search individual trials on ClinicalTrials.gov with a web browser.
+This web portal shows the current information about the trial and provides 
+access to snapshots of previously submitted information.
+Together, these features fulfill most of the needs of those seeking 
+to join a clinical trial.
+For this project I've been able to scrape these historical records to establish
+snapshots of the records provided.
+%include screenshots?
+The second way to access the data is through a normalized database setup by
+the 
+\href{https://aact.ctti-clinicaltrials.org/}{Clinical Trials Transformation Initiative}
+called AACT. %TODO: Get CITATION
+The AACT database is available as a PostgreSQL database dump or set of 
+flat-files. 
+These dumps match a near-current version of the ClinicalTrials.gov database.
+This format is ameniable to large scale analysis, but does not contain 
+information about the past state of trials.
+I combined these two sources, using the AACT dataset to select 
+trials of interest and then scraping \url{ClinicalTrials.gov} to get 
+a timeline of each trial.
+
+%%%%%%%%%%%%%%%%%%%%%%%% Model Outline
+
+The way I use this data is to predict the final status of the trial 
+from the snapshots that were taken, in effect asking:
+``how does the probability of a termination change from the current state 
+of the trial if X changes?''
+
+%% Return to causal identification
+\subsection{Causal Identification}
+
 Because running experiments on companies running clinical trials is not going
-to happen anytime soon, causal identification depends on using an observational
-approach and a structural causal model.
+to happen anytime soon, causal identification depends on using a 
+structural causal model.
 Because the data generating process for the clinical trials records is rather 
 straightforward, this is an ideal place to use
 \authorcite{pearl_causality_2000}
@@ -84,7 +146,6 @@ and provides some hypotheses that can be tested to ensure the model is
 reasonably correct.
 
 
-
 In \cref{Fig:CausalModel} I diagram the directed acyclic graph that describes
 my proposed data generating process,  
 It revolves around the decisions made by the study sponsor, 
@@ -130,6 +191,7 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
         \begin{enumerate}
             \item \texttt{Will Terminate?}: 
                 If the final status of the trial was \textit{terminated} 
+                and comes from the AACT dataset.
                 or \textit{completed}.
             \item \texttt{Enrollment Status}: 
                     This describes the current enrollment status of the snapshot, e.g. 
@@ -147,19 +209,25 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
         \begin{enumerate}
             \item \texttt{Condition}: 
                 The underlying condition, classified by IDC-10 group. 
-                This impacts every other aspect of the model.
+                This impacts every other aspect of the model and is pulled from
+                the AACT dataset.
             \item \texttt{Population (market size)}: 
                 Multiple measures of the impact the disease.
-                These are measured by the DALY cost of the disease in countries that have a 
-                High, High-Medium, Medium, Medium-Low, and Low development scores.
+                These are measured by the DALY cost of the disease, and is 
+                separated by the impact on countries with
+                High, High-Medium, Medium, Medium-Low, and Low 
+                development scores.
                 This data comes from the Institute for Health Metrics' Global Burden of Disease study.
             \item \texttt{Elapsed Duration}: 
                 A normalized measure of the time elapsed in the trial. 
                 Comes from the original estimate of the trial's primary completion date and the registered start date. 
                 I take the difference in days between these, and get the percentage of that time that has elapsed.
+                This calculation is based on data from the snapshots and the 
+                AACT final results.
             \item \texttt{Decision to Proceed with Phase III}: 
                 If the compound development has progressed to Phase III.
-                This is included in the analysis by only including Phase III trials.
+                This is included in the analysis by only including 
+                Phase III trials registered in the AACT dataset.
         \end{enumerate}
     \item Unobserved Confounders (White Boxes)
         \begin{enumerate}
@@ -168,14 +236,22 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
                 Cannot be observed, only estimated through scientific study.
             \item \texttt{Previously observed Efficacy and Safety}: 
                 The information gathered in previous studies. 
-                This is not available in my dataset because I don't have links to prior studies.
+                This is not available in my dataset because I don't 
+                have links to prior studies.
             \item \texttt{Currently observed Efficiency and Safety}:
                 The information gathered during this study.
-                This is only partially available, and so is treated as unavailable. 
-                After a study is over, the investigators are supposed to publish information about adverse events.
+                This is only partially available, and so is 
+                treated as unavailable. 
+                After a study is over, the investigators are 
+                often publish information about adverse events, but only
+                those that meet a certain threshold.
+                As this information doesn't appear to be provided to 
+                participants, we don't consider it.
         \end{enumerate}
 \end{itemize}
 
+%
+
 \begin{itemize}
     \item Relationships of interest
         \begin{enumerate}