Merge branch 'main'

2 years ago · 4302d07ef8
parent 907214e359 70ef27c57a
commit 4302d07ef8
5 changed files with 208 additions and 70 deletions
--- a/Paper/Main.tex
+++ b/Paper/Main.tex
@ -84,7 +84,7 @@ Section \ref{SEC:Results} discusses the results of the analysis.
 \subfile{sections/06_Results}

 %---------------------------------------------------------------
-\section{Improvements}\label{SEC:Improvements}
+\section{Deficiencies and Improvements}\label{SEC:Improvements}
 %---------------------------------------------------------------
 \subfile{sections/08_PotentialImprovements}

--- a/Paper/jmp_layout_laptop.kdl
+++ b/Paper/jmp_layout_laptop.kdl
@ -1,5 +1,5 @@
 layout {
-    tab name="Main and Compile" cwd="~/research/phd_deliverables/jmp/Latex/Paper" hide_floating_panes=true focus=true {
+    tab name="Main and Compile" cwd="~/research/phd_deliverables/JobMarketPaper/Paper"  hide_floating_panes=true focus=true {
    // This tab is where I manage main from. 
    // it opens up Main.txt for my JMP, opens the pdf in okular (in a floating tab), and then get's ready to build the pdf.
        pane size=1 borderless=true {
@ -33,7 +33,7 @@ layout {
        }
    }

-    tab name="sections" cwd="~/research/phd_deliverables/jmp/Latex/Paper/sections" {
+    tab name="sections" cwd="~/research/phd_deliverables/JobMarketPaper/Paper"  {
        pane size=1 borderless=true {
            plugin location="tab-bar"
        }
@ -56,7 +56,7 @@ layout {
        }
    }

-    tab name="git" cwd="~/research/phd_deliverables/jmp/Latex/Paper/" {
+    tab name="git" cwd="~/research/phd_deliverables/JobMarketPaper/Latex/Paper/" {
        pane size=1 borderless=true {
            plugin location="tab-bar"
        }
--- a/Paper/sections/06_Results.tex
+++ b/Paper/sections/06_Results.tex
@ -27,21 +27,18 @@ the correlation (measured at $0.34$) is apparent.

 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted}
-    \todo{Replace this graphic with the histogram of trial durations}
    \caption{Histograms of Trial Durations}
    \label{fig:trial_durations}
 \end{figure}

 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
-    \todo{Replace this graphic with the histogram of snapshots}
    \caption{Histogram of the count of Snapshots}
    \label{fig:snapshot_counts}
 \end{figure}

 \begin{figure}[H]
    \includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination}
-    \todo{Replace this graphic with the scatterplot comparing durations and snapshots}
    \caption{Scatterplot comparing the Count of Snapshots and Trial Duration}
    \label{fig:snapshot_counts}
 \end{figure}
@ -74,6 +71,34 @@ not represented at all.
    \label{fig:barchart_idc_categories}
 \end{figure}

+% Estimation Procedure
+I fit the econometric model using mc-stan 
+\cite{standevelopmentteam_StanModelling_2022}
+through the rstan 
+\cite{standevelopmentteam_RStanInterface_2023}
+interface using 4 chains with 
+%describe  
+2,500
+warmup iterations and
+2,500
+sampling iterations each.
+
+Two of the chains experienced a low 
+Estimated Baysian Fraction of Missing Information (E-BFMI) ,
+suggesting that there are some parts of the posterior distribution
+that were not explored well during the model fitting. 
+I presume this is due to the low number of trials in some of the 
+ICD-10 categories.
+We can see in Figure \ref{fig:barchart_idc_categories} that some of these 
+disease categories had a single trial represented while others were 
+not represented at all.
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts}
+    \caption{Bar chart of trials by ICD-10 categories}
+    \label{fig:barchart_idc_categories}
+\end{figure}
+

 \subsection{Primary Results}

@ -202,5 +227,144 @@ result comes from different disease categories.
 \end{figure}
 % - 

+\subsection{Primary Results}
+
+The primary, causally-identified value we can estimate is the change in 
+the probability of termination caused by (counterfactually) keeping enrollment
+open instead of closing enrollment when observed. 
+In figure \ref{fig:pred_dist_diff_delay} below, we see this impact of 
+keeping enrollment open.
+
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/dist_diff_analysis/p_delay_intervention_distdiff_boxplot}
+    \small{
+        Values near 1 indicate a near perfect increase in the probability 
+        of termination. 
+        Values near 0 indicate little change in probability,
+        while values near -1, represent a decrease in the probability
+        of termination. 
+        The scale is in probability points, thus a value near 1 is a change 
+        from unlikely to terminate under control, to highly likely to 
+        terminate.
+    }
+    \caption{Histogram of the Distribution of Predicted Differences}
+    \label{fig:pred_dist_diff_delay}
+\end{figure}
+
+There are a few interesting things to point out here. 
+Let's start by getting aquainted with the details of the distribution above.
+% - spike at 0
+% - the boxplot
+% - 63% of mass below 0 : find better way to say that
+%   - For a random trial, there is a 63% chance that the impact is to reduce the probability of a termination.
+% - 2 pctg-point wide band centered on 0 has ~13% of the masss
+% - mean represents 9.x% increase in probability of termination. A quick simulation gives about the same pctg-point increase in terminated trials.
+
+A few interesting interpretation bits come out of this.
+% - there are 3 regimes: low impact (near zero), medium impact (concentrated in decreased probability of termination), and high impact (concentrated in increased probability of termination). 
+The first this that there appear to be three different regimes. 
+The first regime consists of the low impact results, i.e. those values of $\delta_p$ 
+near zero. 
+About 13\% of trials lie within a single percentage point change of zero, 
+suggesting that there is a reasonable chance that delaying 
+a close of enrollment has no impact. 
+The second regime consists of the moderate impact on clinical trials'
+probabilities of termination, say values in the interval $[-0.5, 0.5]$ 
+on the graph.
+Most of this probability mass is represents a decrease in the probability of 
+a termination, some of it rather large.
+Finally, there exists the high impact region, almost exclusively concentrated 
+around increases in the probability of termination at $\delta_p > 0.75$. 
+These represent cases where delaying the close of enrollemnt changes a trial
+from a case where they were highly likely to complete their primary objectives to 
+a case where they were likely or almost certain to terminate the trial early.
+%   - the high impact regime is strange because it consists of trials that moved from unlikely (<20% chance) of termination to a high chance (>80% chance) of termination. Something like 5% of all trials have a greater than 98 percentage point increase in termination. Not sure what this is doing. 
+
+%   - Potential Explanations for high impact regime:
+How could this intervention have such a wide range in the intensity 
+and direction of impacts?
+A few explanations include that some trials are suceptable or that this is a 
+result of too little data.
+%       - Some trials are highly suceptable. This is the face value effect
+One option is that some categories are more suceptable to 
+issues with participant enrollment. 
+If this is the case, we should be able to isolate categories that contribute
+the most to this effect.
+Another is that this might be a modelling artefact, due to the relatively
+low number of trials in certain ICD-10 categories. 
+In short, there might be high levels of uncertanty in some parameter values,
+which manifest as fat tails in the distributions of the $\beta$ parameters. 
+Because of the logistic format of the model, these fat tails lead to 
+extreme values of $p$, and potentally large changes $\delta_p$. 
+%       - Could be uncertanty. If the model is highly uncertain, e.g. there isn't enough data, we could have a small percentage of large increases. This could be in general or just for a few categories with low amounts of data.
+% - 
+% - 
+
+I believe that this second explanation -- a model artifact due to uncertanty --
+is likely to be the cause. 
+Three points lead me to believe this:
+\begin{itemize}
+    \item The low fractions of E-BFMI suggest that the sampler is struggling 
+        to explore some regions of the posterior. 
+        According to \cite{standevelopmentteam_RuntimeWarnings_2022} this is 
+        often due to thick tails of posterior distributions.
+    \item When we examine the results across different ICD-10 groups, 
+        \ref{fig:pred_dist_dif_delay2}
+        we note this same issue.
+    \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories
+        \todo{add figure}
+        have \todo{note fat tails}.
+    \item There are few trials available, particularly among some specific 
+        ICD-10 categories.
+\end{itemize}
+%           - take a look at beta values and then discuss if that lines up with results from dist-diff by group. 
+%       - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories.
+% - 
+% - 
+% - 
+
+We can examine the per-group distributions of differences in \ref{fig:pred_dist_dif_delay2} to 
+acertain that the high impact group does exist in each of the groups.
+This lends credence to the idea that this is a modelling issue, potentially
+due to the low amounts of data overall.
+
+
+Figure \ref{fig:pred_dist_dif_delay2} shows how this overall
+result comes from different disease categories.
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/dist_diff_analysis/p_delay_intervention_distdiff_by_group}
+    \caption{Distribution of Predicted differences by Disease Group}
+    \label{fig:pred_dist_dif_delay2}
+\end{figure}
+
+
+
+% Examine beta parameters 
+% - Little movement except where data is strong, general negative movement. Still really wide 
+% - Note how they all learned (partial pooling) reduction in \beta from ANR?
+% - Need to discuss the 5 different states. Can't remember which one is dropped for the life of me. May need to fix parameterization.
+% - 
+Finally, in figure \ref{fig:parameters_ANR_by_group}, we can see the estimated distributions of the $\beta$ parameter for
+the status: \textbf{Active, not recruiting}.
+The prior distributions were centered on zero, but we can see that the pooled learning has moved the mean
+values negative, representing reductions in the probability of termination across the board. 
+This decrease in the probability of termination is strongest in the categories of Neoplasms ($n=$),
+Musculoskeletal diseases ($n=$), and Infections and Parasites ($n=$), the three categories with the most data.
+As this is a comparison against the trial status XXX, we note that
+\todo{The natural comparison I want to make is against the Recruting status. Do I want to redo this so that I can read that directly?It shouldn't affect the $\delta_p$ analysis, but this could probably use it.}
+Overall, this suggests that extending a clinical trial's enrollment period will reduce the probability of termination.
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/betas/parameter_across_groups/parameters_12_status_ANR}
+    \caption{Distribution of parameters associated with ``Active, not recruiting'' status, by ICD-10 Category}
+    \label{fig:parameters_ANR_by_group}
+\end{figure}
+% - 
+
+Overally it is hard to escape the conclusion that more data is needed across
+many -- if not all -- of the disease categories.

 \end{document}
+
+
--- a/Paper/sections/08_PotentialImprovements.tex
+++ b/Paper/sections/08_PotentialImprovements.tex
@ -4,64 +4,33 @@
 \begin{document}

 As noted above, there are various issues with the analysis as completed so far.
-Below I discuss various steps that I believe will improve the analysis.
+Below I discuss various issues and ways to address them that I believe will improve the analysis.

 \subsection{Increasing number of observations}

 The most important step is to increase the number of observations available.
-Currently this requires matching trials to ICD-10 codes by hand, but
-there are certainly some steps that can be taken to improve the speed with which
-this can be done.
-%
-% \subsection{Covariance Structure}
-%
-% As noted in the diagnostics section, many of the convergence issues seem
-% to occure in the covariance structure. 
-% Instead of representing the parameters $\beta$ as independently normal:
-% \begin{align}
-%     \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k)
-% \end{align}
-% I propose using a multivariate normal distribution:
-% \begin{align}
-%     \beta(d) \sim \text{MvNormal}(\mu, \Sigma)
-% \end{align}
-% I am not familiar with typical approaches to priors on the covariance matrix,
-% so this will require a further literature search as to best practices.
+Currently this requires matching trials to ICD-10 codes by hand.
+Improvements in Large-Language-Models may make this data more accessible, or
+the data may be available in a commercial dataset.

-% \subsection{Finding Reasonable Priors}
-%
-% In standard bayesian regression, heavy tailed priors are common. 
-% When working with a bayesian bernoulli-logit model, this is not appropriate as 
-% heavy tails cause the estimated probabilities $p_n$ to concentrate around the 
-% values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in
-% \cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this.
-%
-% I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using
-% prior predictive checks to evaluate the implications of different priors
-% on the distribution on $p_n$.
-% This would consist of taking the independent variables and predicting the values
-% of $p_n$ based on a proposed set of priors. 
-% By plotting these predictions, I can ensure that the specific parameter priors 
-% used are consistent with my prior beliefs on how $p_n$ behaves.
-% Currently I believe that $p_n$ should be roughly uniform or unimodal, centered 
-% around $p_n = \frac{1}{2}$.
-%

-\subsection{Imputing Enrollment}
+\subsection{Enrollment Modelling}

-Finally, I must address the issue of how enrollment is reported.
-In many cases, the trial continues to report an anticipated enrollment value
-while the trial is still recruiting.
-Thus using anticipated enrollment figures is inappropriate.
-I am planning on using bayesian imputation to estimate actual enrollment
-when it has not yet occured. 
-This will require building a statistical model of the enrollment process.
-One advantage this dataset has is that trial sponsors provide their anticipated
-enrollment numbers, allowing me to use this in the prediction model.
-Additionally, each snapshot contains the elapsed duration and current status of
-the trial , which may help improve the prediction.
-Although predicted enrollment will be imprecise, it explicitly accounts for
-uncertanty in the imputation and dependent calculations \cite{mcelreath_statistical_2020}.
+One of the original goals of this project was to examine the impact that 
+enrollment struggles have on the probability of trial termination. 
+Unfortunately, this requires a model of clinical trial enrollment, and the
+data is just not in my dataset.
+In most cases the trial sponsor reports the anticipated enrollment value
+while the trial is still recruiting and only updates the actual enrollment
+after the trial has ended.
+Some trials do publish an up to date record of their enrollment numbers, but this
+is rare. 
+If a bayesian model of multisite enrollment can be developed for the disease categories
+in question, then it will be possible to impute this missing data probabalistically,
+which will allow me to estimate the direct effect of slow enrollment
+\cite{mcelreath_statistical_2020}.
+This does not exist yet, although some work on multi-site enrollment forecasting has 
+been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS} 

 \subsection{Improving Population Estimates}

@ -74,24 +43,31 @@ drug resistant and drug suceptible tuberculosis.
 In contrast, there is no category for non-age related macular degeneration.
 One resulting concern is that for a given ICD-10 code, the applicable GBD population 
 estimates may act as an estimate of the upper bound of population size
-(\cite{global_burden_of_disease_collective_network_global_2020}). %fix citation
-I would like to explicitly address this in my model, although I have not 
-found a way to do so.
+(\cite{global_burden_of_disease_collective_network_global_2020}).
+The dataset contains various measures of disease severity, so it may be 
+worth investigating how to incorporate some of those measures.


 \subsection{Improving Measures of Market Conditions}

+% Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases)
+
 In addition to the fact that many diseases may be treated by non-pharmaceutical 
 means, off-label prescription of pharmaceuticals is legal at the federal level 
 (\cite{commissioner_understanding_2019}).
 These two facts both complicate measuring market conditions.
 One way to address non-pharmaceutical treatments is to concentrate on domains
 that are primarily treated by pharmaceuticals.
-This requires domain knowledge that I don't have.
-% One dataset that I have only investigated briefly is the \url{DrugCentral.org}
-% database which tracks official indications and some off-label indications as 
-% well
-% (\cite{ursu_drugcentral_2017}).
+Another way to address this would be to focus the analysis on just a few specific
+diseases, for which a history of treatment options can be compiled.
+This second approach may also allow the researcher to distinguish the direction
+of causality between population size and number of drugs on the market; 
+for example, drugs to treat a chronic, non-fatal disease will probably not 
+affect the market size much in the short to medium term.
+This allows the effect of market conditions to be isolated from
+the effects of the population.

+% Alternative approaches 
+% - diseases with constant kill rates? population effect should be relatively constant?

 \end{document}
--- a/Paper/sections/09_Conclusion.tex
+++ b/Paper/sections/09_Conclusion.tex
@ -6,9 +6,7 @@ Identifying commercial impediments to successfully completing
 clinical trials in otherwise capable pharmaceuticals will hopefully 
 lead to a more robust and competitive market.
 Although the current state of this research is insufficient to draw robust
-conclusions, early results suggest that enrollment rates have some impact
-on whether or not a clinical trial terminates early or continues
-to full completion.
-
+conclusions, these early results suggest that delaying the close of enrollment periods
+reduces the probability of termination of a trial.

 \end{document}