Merge branch 'main'

1 year ago · 6f03d6ba08
parent ab98934dc6 9119a8f365
commit 6f03d6ba08
7 changed files with 4716 additions and 142 deletions
--- a/Paper/sections/02_data.tex
+++ b/Paper/sections/02_data.tex
@ -131,11 +131,13 @@ drug compound may be packaged in multiple ways, e.g. boxes with different
 numbers of blister packs.
 These SPLs are made available for download so that they can be integrated 
 into patient health systems to improve patient safety 
-\cite{indexingsplfactsheet_}.
+\cite{usfda_splfactsheet_2023}.

 The FDA also published additional data in the NDC SPL Data Elements (NSDE) file.
 This file contains some of the data from the SPL files, as well as the dates 
 when each product was approved for sale and when it was removed from the market.
+This summary of SPLs is what I used to find which drugs were approved 
+to be on the market at a given date.

 %Structured Product Labels and dates of marketing
 %   Key features
@ -186,7 +188,7 @@ In each section below I briefly describe each terminology, its contents, and use
 The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National
 Library of Medicine.
 It is used to index subjects in various NLM publications including PubMed 
-\cite{medicalsubjectheadingshomepage_}.
+\cite{usnlm_meshhomepage_2023}.
 The AACT database contains a table that links clinical trials' clinical conditions
 and drug names to terms in the MeSH thesaurus.
 As this contains a standardized nomenclature, it simplified much of the 
@ -216,7 +218,7 @@ The one I chose to use was a MariaDB database that backs a service called RxNav
 provided by the National Library of Medicine (NLM). 
 The NLM provides scripts to set up and host the backing databases on your
 own servers
-\cite{usnlm_rxnaviabox_2023}. 
+\cite{usnlm_rxnavinabox_2023}. 
 After setting up the local server, I wrote a python program to export 
 the data from the RxNorm database and import it into the AACT Database.
 This was required because the former uses a MariaDB database server
--- a/Paper/sections/04_EconometricModel.tex
+++ b/Paper/sections/04_EconometricModel.tex
@ -38,7 +38,7 @@ The betas are distributed
    \beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
 \end{align}
 With hyperpriors
-%Checked on 2024-11-27. Is corrrect. \todo{Double check that these are the priors I used.}
+%Checked on 2024-11-27. Is corrrect. 
 \begin{align}
    \mu_k \sim \text{Normal}(0,0.05) \\
    \sigma_k \sim \text{Gamma}(4,20)
--- a/Paper/sections/06_Results.tex
+++ b/Paper/sections/06_Results.tex
@ -24,6 +24,11 @@ relationship between the number of snapshots and the duration of trials.
 We can see this in Figure \ref{fig:snapshot_duration_scatter}, where
 the correlation (measured at $0.34$) is apparent.

+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
+    \caption{Histogram of the count of Snapshots}
+    \label{fig:snapshot_counts}
+\end{figure}

 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted}
@ -31,23 +36,17 @@ the correlation (measured at $0.34$) is apparent.
    \label{fig:trial_durations}
 \end{figure}

-\begin{figure}[H]
-    \includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
-    \caption{Histogram of the count of Snapshots}
-    \label{fig:snapshot_counts}
-\end{figure}
-
 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination}
    \caption{Scatterplot comparing the Count of Snapshots and Trial Duration}
-    \label{fig:snapshot_counts}
+    \label{fig:snapshot_duration_scatter}
 \end{figure}

 % Estimation Procedure
 I fit the econometric model using mc-stan 
-\cite{standevelopmentteamStanModellingUsersGuide2022}
+\cite{standevelopmentteam_stanmodellingusersguide_2022}
 through the rstan 
-\cite{standevelopmentteamRStanInterfaceStan2023}
+\cite{standevelopmentteam_rstaninterfacestan_2023}
 interface using 4 chains with 
 %describe  
 2,500
@ -58,17 +57,18 @@ sampling iterations each.
 Two of the chains experienced a low 
 Estimated Baysian Fraction of Missing Information (E-BFMI) ,
 suggesting that there are some parts of the posterior distribution
-that were not explored well during the model fitting. 
+that were not explored well during the model fitting
+\cite{standevelopmentteam_runtimewarningsconvergence_2022}.
 I presume this is due to the low number of trials in some of the 
 ICD-10 categories.
-We can see in Figure \ref{fig:barchart_idc_categories} that some of these 
+We can see in Figure \ref{FIG:barchart_idc_categories} that some of these 
 disease categories had a single trial represented while others were 
 not represented at all.

 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts}
    \caption{Bar chart of trials by ICD-10 categories}
-    \label{fig:barchart_idc_categories}
+    \label{FIG:barchart_idc_categories}
 \end{figure}

 % Estimation Procedure
@ -308,20 +308,20 @@ Three points lead me to believe this:
    \item The low fractions of E-BFMI suggest that the sampler is struggling 
        to explore some regions of the posterior. 
        According to 
-        \cite{standevelopmentteam_RuntimeWarnings_2022}
-        
+        \authorcite{standevelopmentteam_runtimewarningsconvergence_2022}
        this is 
        often due to thick tails of posterior distributions.
    \item When we examine the results across different ICD-10 groups, 
        \ref{fig:pred_dist_dif_delay2}
        we note this same issue.
-    \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories
-        \todo{add figure}
-        have 
+    \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some 
+        ICD-10 categories have 
        \todo{note fat tails}.
    \item There are few trials available, particularly among some specific 
        ICD-10 categories.
+        \todo{refer to figure ??}
 \end{itemize}
+\todo{Reformat so this refers to the original discussion of issues better.}
 %           - take a look at beta values and then discuss if that lines up with results from dist-diff by group. 
 %       - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories.
 % - 
--- a/Paper/sections/08_PotentialImprovements.tex
+++ b/Paper/sections/08_PotentialImprovements.tex
@ -18,19 +18,42 @@ the data may be available in a commercial dataset.

 One of the original goals of this project was to examine the impact that 
 enrollment struggles have on the probability of trial termination. 
-Unfortunately, this requires a model of clinical trial enrollment, and the
-data is just not in my dataset.
+Unfortunately, this requires a model of clinical trial enrollment, and this
+data is missing from my dataset.
 In most cases the trial sponsor reports the anticipated enrollment value
 while the trial is still recruiting and only updates the actual enrollment
 after the trial has ended.
-Some trials do publish an up to date record of their enrollment numbers, but this
-is rare. 
-If a bayesian model of multisite enrollment can be developed for the disease categories
-in question, then it will be possible to impute this missing data probabalistically,
-which will allow me to estimate the direct effect of slow enrollment
-\cite{mcelreath_statistical_2020}.
-This does not exist yet, although some work on multi-site enrollment forecasting has 
-been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS} 
+Some trials do publish an incremental record of their enrollment numbers, 
+but this is rare. 
+Due to the bayesian model used, it would be possible to 
+include a model of the missing data 
+\cite{mcelreath_statisticalrethinkingbayesian_2020}.
+which would
+allow me to estimate the direct effect of slow enrollment 
+on clinical trial termination rates
+
+There has been substantial work on forecasting
+multi-site enrollment rates and durations by
+\cite{
+    tozzi_predictingaccrualrate_1996,
+    carter_applicationstochasticprocesses_2004,
+    anisimov_modellingpredictionadaptive_2007,
+    zhang_stochasticmodelingprediction_2010,
+    zhang_jointmonitoringprediction_2012,
+    zhang_modelingpredictionsubject_2012,
+    heitjan_realtimepredictionclinical_2015,
+    jiang_modelingvalidatingbayesian_2015,
+    deng_bayesianmodelingprediction_2017,
+    lan_statisticalmodelingprediction_2019,
+    zhang_simplerobustmodel_2022,
+    urbas_interimrecruitmentprediction_2022,
+    bieganek_predictionclinicaltrial_2022,
+    avalos-pacheco_validationpredictiveanalyses_2023,
+    }
+but choosing between the various single and multi-site models presented is 
+difficult without a dataset to validate the results on. 
+
+

 \subsection{Improving Population Estimates}

@ -39,23 +62,22 @@ population sizes that I have found so far.
 Unfortunately, for some conditions it can be relatively imprecise due to 
 its focus on providing data geared towards public health policy.
 For example, GBD contains categories for both
-drug resistant and drug suceptible tuberculosis.
+drug resistant and drug suceptible tuberculosis, but maps those to the same 
+ICD-10 code.
 In contrast, there is no category for non-age related macular degeneration.
-One resulting concern is that for a given ICD-10 code, the applicable GBD population 
-estimates may act as an estimate of the upper bound of population size
-(\cite{global_burden_of_disease_collective_network_global_2020}).
-The dataset contains various measures of disease severity, so it may be 
-worth investigating how to incorporate some of those measures.
-
+Thus not every trial has a good match with the estimate of the population of
+interest.

 \subsection{Improving Measures of Market Conditions}

 % Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases)

 In addition to the fact that many diseases may be treated by non-pharmaceutical 
-means, off-label prescription of pharmaceuticals is legal at the federal level 
-(\cite{commissioner_understanding_2019}).
-These two facts both complicate measuring market conditions.
+means (e.g. diet, physical therapy, medical devices, etc), 
+off-label prescription of pharmaceuticals is legal at the federal level 
+\cite{commissioner_understandingunapproveduse_2019}.
+These two facts both complicate measuring competing treatments, 
+a key part of market conditions.
 One way to address non-pharmaceutical treatments is to concentrate on domains
 that are primarily treated by pharmaceuticals.
 Another way to address this would be to focus the analysis on just a few specific
--- a/Paper/sections/10_CausalStory.tex
+++ b/Paper/sections/10_CausalStory.tex
@ -219,8 +219,9 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
                These are measured by the DALY cost of the disease, and is 
                separated by the impact on countries with
                High, High-Medium, Medium, Medium-Low, and Low 
-                development scores.
-                This data comes from the Institute for Health Metrics' Global Burden of Disease study.
+                Socio-Demographic Index (SDI) scores.
+                This data comes from the Institute for Health Metrics' Global Burden of Disease study
+                \cite{vos_globalburden369_2020}.
            \item \texttt{Elapsed Duration}: 
                A normalized measure of the time elapsed in the trial. 
                Comes from the original estimate of the trial's primary completion date and the registered start date. 
--- a/assets/preambles/References.bib
+++ b/assets/preambles/References.bib
--- a/todo.org
+++ b/todo.org
@ -33,3 +33,6 @@
 *** 2025-01-20 Monday
 **** TODO get a citation for the AACT project  
    [[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/10_CausalStory.tex::114]]]] 
+*** 2025-01-23 Thursday
+**** TODO Pickup citation fixes here  
+    [[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/06_Results.tex::174]]]]