From 86f9b8dfc9f621b069ac83dab9f2d8c053ab789c Mon Sep 17 00:00:00 2001
From: will king <youainti@protonmail.com>
Date: Mon, 13 Jan 2025 09:24:20 -0800
Subject: [PATCH] finished drafting results

---
 Paper/sections/06_Results.tex               | 171 +++++++++++++++-----
 Paper/sections/08_PotentialImprovements.tex |  86 +++++-----
 2 files changed, 176 insertions(+), 81 deletions(-)

diff --git a/Paper/sections/06_Results.tex b/Paper/sections/06_Results.tex
index 6eb2f02..9ae780a 100644
--- a/Paper/sections/06_Results.tex
+++ b/Paper/sections/06_Results.tex
@@ -7,24 +7,73 @@ I describe the model fitting, the posteriors of the parameters of interest,
 and intepret the results.
 
 
-\subsection{Estimation Procedure}
+\subsection{Data Summaries and Estimation Procedure}
+
+% Data Summaries
+Overall, I successfully processed 162 trials, with 1,347 snapshots between them.
+Figure \ref{fig:snapshot_counts} shows the histogram of snapshots per trial.
+Most trials lasted less than 1,500 days, as can be seen in 
+\ref{fig:trial_durations}. 
+Although there are a large number of snapshots that will be used to fit the 
+model, the number of trials -- the unit of observation -- are quite low. 
+Add to the  fact that these are spread over multiple IDC-10 categories
+and the overall quantity of trials is quite low. 
+
+To continue, we can use a scatterplot to get a rough idea of the observed
+relationship between the number of snapshots and the duration of trials. 
+We can see this in Figure \ref{fig:snapshot_duration_scatter}, where
+the correlation (measured at $0.34$) is apparent.
+
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/current/pred_dist_diff-delay}
+    \todo{Replace this graphic with the histogram of trial durations}
+    \caption{Histograms of Trial Durations}
+    \label{fig:trial_durations}
+\end{figure}
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/current/pred_dist_diff-delay}
+    \todo{Replace this graphic with the histogram of snapshots}
+    \caption{Histogram of the count of Snapshots}
+    \label{fig:snapshot_counts}
+\end{figure}
+
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/current/pred_dist_diff-delay}
+    \todo{Replace this graphic with the scatterplot comparing durations and snapshots}
+    \caption{Scatterplot comparing the Count of Snapshots and Trial Duration}
+    \label{fig:snapshot_counts}
+\end{figure}
+
+% Estimation Procedure
 I fit the econometric model using mc-stan 
 \cite{standevelopmentteam_StanModelling_2022}
 through the rstan 
 \cite{standevelopmentteam_RStanInterface_2023}
-interface.
-
-I had X Trials with X snapshots in total. \todo{Fill out.} 
-
+interface using 4 chains with 
 %describe  
-X\todo{UPDATE VALUES} 
+2,500
 warmup iterations and
-X\todo{UPDATE VALUES} 
-sampling iterations in six chains.
+2,500
+sampling iterations each.
+
+Two of the chains experienced a low 
+Estimated Baysian Fraction of Missing Information (E-BFMI) ,
+suggesting that there are some parts of the posterior distribution
+that were not explored well during the model fitting. 
+I presume this is due to the low number of trials in some of the 
+IDC-10 categories.
+We can see in Figure \ref{fig:barchart_idc_categories} that some of these 
+disease categories had a single trial represented while others were 
+not represented at all.
 
-% \subsection{Data Exploration} 
-% \todo{fill this out later.}
-%look at trial 
+\begin{figure}[H]
+    \includegraphics[width=\textwidth]{../assets/img/current/pred_dist_diff-delay}
+    \todo{Replace this graphic with the barchart of trials by categories.}
+    \caption{Bar chart of trials by IDC-10 categories}
+    \label{fig:barchart_idc_categories}
+\end{figure}
 
 
 \subsection{Primary Results}
@@ -38,6 +87,7 @@ keeping enrollment open.
 
 \begin{figure}[H]
     \includegraphics[width=\textwidth]{../assets/img/current/pred_dist_diff-delay}
+    \todo{Replace this graphic with the histdiff with boxplot}
     \small{
         Values near 1 indicate a near perfect increase in the probability 
         of termination. 
@@ -52,26 +102,84 @@ keeping enrollment open.
     \label{fig:pred_dist_diff_delay}
 \end{figure}
 
-We can see from figure 
-\ref{fig:pred_dist_diff_delay} 
-That there are roughly four regimes. 
-The first consists of trials that experiences nearly no effect,
-i.e. have values near zero.
-Trials in the second regime experience a mild to large reduction in 
-the probability of termination, with X percent of the probability mass 
-between about 5 percentage points and 50 percentage point  reductions.
-The third regime is those trials that experience a mild to large 
-increase in the probability of termination, 
-from an increase o 5 percentage points to about 75 percentage points. 
-The fourth and final regime is the X\% of trials that experience a significant
-(greater than 75 percentage point) increase in the probability of 
-termination.
-%Notes on interpretation
-% - increase vs decrease on graph 
+There are a few interesting things to point out here. 
+Let's start by getting aquainted with the details of the distribution above.
+% - spike at 0
+% - the boxplot
+% - 63% of mass below 0 : find better way to say that
+%   - For a random trial, there is a 63% chance that the impact is to reduce the probability of a termination.
+% - 2 pctg-point wide band centered on 0 has ~13% of the masss
+% - mean represents 9.x% increase in probability of termination. A quick simulation gives about the same pctg-point increase in terminated trials.
+
+A few interesting interpretation bits come out of this.
+% - there are 3 regimes: low impact (near zero), medium impact (concentrated in decreased probability of termination), and high impact (concentrated in increased probability of termination). 
+The first this that there appear to be three different regimes. 
+The first regime consists of the low impact results, i.e. those values of $\delta_p$ 
+near zero. 
+About 13\% of trials lie within a single percentage point change of zero, 
+suggesting that there is a reasonable chance that delaying 
+a close of enrollment has no impact. 
+The second regime consists of the moderate impact on clinical trials'
+probabilities of termination, say values in the interval $[-0.5, 0.5]$ 
+on the graph.
+Most of this probability mass is represents a decrease in the probability of 
+a termination, some of it rather large.
+Finally, there exists the high impact region, almost exclusively concentrated 
+around increases in the probability of termination at $\delta_p > 0.75$. 
+These represent cases where delaying the close of enrollemnt changes a trial
+from a case where they were highly likely to complete their primary objectives to 
+a case where they were likely or almost certain to terminate the trial early.
+%   - the high impact regime is strange because it consists of trials that moved from unlikely (<20% chance) of termination to a high chance (>80% chance) of termination. Something like 5% of all trials have a greater than 98 percentage point increase in termination. Not sure what this is doing. 
+
+%   - Potential Explanations for high impact regime:
+How could this intervention have such a wide range in the intensity 
+and direction of impacts?
+A few explanations include that some trials are suceptable or that this is a 
+result of too little data.
+%       - Some trials are highly suceptable. This is the face value effect
+One option is that some categories are more suceptable to 
+issues with participant enrollment. 
+If this is the case, we should be able to isolate categories that contribute
+the most to this effect.
+Another is that this might be a modelling artefact, due to the relatively
+low number of trials in certain IDC-10 categories. 
+In short, there might be high levels of uncertanty in some parameter values,
+which manifest as fat tails in the distributions of the $\beta$ parameters. 
+Because of the logistic format of the model, these fat tails lead to 
+extreme values of $p$, and potentally large changes $\delta_p$. 
+%       - Could be uncertanty. If the model is highly uncertain, e.g. there isn't enough data, we could have a small percentage of large increases. This could be in general or just for a few categories with low amounts of data.
+% - 
+% - 
+
+I believe that this second explanation -- a model artifact due to uncertanty --
+is likely to be the cause. 
+Three points lead me to believe this:
+\begin{itemize}
+    \item The low fractions of E-BFMI suggest that the sampler is struggling 
+        to explore some regions of the posterior. 
+        According to \cite{standevelopmentteam_RuntimeWarnings_2022} this is 
+        often due to thick tails of posterior distributions.
+    \item When we examine the results across different ICD-10 groups, 
+        \ref{fig:pred_dist_dif_delay2}
+        \todo{move figure from below}
+        we note this same issue.
+    \item In Figure \ref{fig:betas_delay}, we see that some some ICD-10 categories
+        \todo{add figure}
+        have \todo{note fat tails}.
+    \item There are few trials available, particularly among some specific 
+        ICD-10 categories.
+\end{itemize}
+% NOTE: maybe change order to be ebfmi, group hist-diff or distdiff, tail width, then data size.
+%           - take a look at beta values and then discuss if that lines up with results from dist-diff by group. 
+%       - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories.
 % - 
 % - 
 % - 
 % - 
+Overally it is hard to escape the result that more data is needed, across
+many, if not all, of the disease categories.
+
+
 
 % The probability mass associated with a each 10 percentage point change are in table \ref{tab:regimes}
 % \begin{table}[H]
@@ -100,15 +208,6 @@ result comes from different disease categories.
     \label{fig:pred_dist_dif_delay2}
 \end{figure}
 
-Overall, we can see that there appear to be some trials or situations 
-that are highly suceptable to enrollment difficulties, and this 
-appears to hold for all disease categories for which I have data.
-This relative homogeneity of results may be due to the 
-partial pooling effect from the hierarchal model 
-and the fact that the sample size per disease is rather small.
-An additional explanation is that the variance of the parameter distributions
-might be high enough for each trial to have a few situation in which they have
-a high probability of terminating.
 
 
 
diff --git a/Paper/sections/08_PotentialImprovements.tex b/Paper/sections/08_PotentialImprovements.tex
index 2f89ab3..c85cae7 100644
--- a/Paper/sections/08_PotentialImprovements.tex
+++ b/Paper/sections/08_PotentialImprovements.tex
@@ -12,40 +12,40 @@ The most important step is to increase the number of observations available.
 Currently this requires matching trials to ICD-10 codes by hand, but
 there are certainly some steps that can be taken to improve the speed with which
 this can be done.
-
-\subsection{Covariance Structure}
-
-As noted in the diagnostics section, many of the convergence issues seem
-to occure in the covariance structure. 
-Instead of representing the parameters $\beta$ as independently normal:
-\begin{align}
-    \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k)
-\end{align}
-I propose using a multivariate normal distribution:
-\begin{align}
-    \beta(d) \sim \text{MvNormal}(\mu, \Sigma)
-\end{align}
-I am not familiar with typical approaches to priors on the covariance matrix,
-so this will require a further literature search as to best practices.
-
-\subsection{Finding Reasonable Priors}
-
-In standard bayesian regression, heavy tailed priors are common. 
-When working with a bayesian bernoulli-logit model, this is not appropriate as 
-heavy tails cause the estimated probabilities $p_n$ to concentrate around the 
-values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in
-\cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this.
-
-I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using
-prior predictive checks to evaluate the implications of different priors
-on the distribution on $p_n$.
-This would consist of taking the independent variables and predicting the values
-of $p_n$ based on a proposed set of priors. 
-By plotting these predictions, I can ensure that the specific parameter priors 
-used are consistent with my prior beliefs on how $p_n$ behaves.
-Currently I believe that $p_n$ should be roughly uniform or unimodal, centered 
-around $p_n = \frac{1}{2}$.
-
+%
+% \subsection{Covariance Structure}
+%
+% As noted in the diagnostics section, many of the convergence issues seem
+% to occure in the covariance structure. 
+% Instead of representing the parameters $\beta$ as independently normal:
+% \begin{align}
+%     \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k)
+% \end{align}
+% I propose using a multivariate normal distribution:
+% \begin{align}
+%     \beta(d) \sim \text{MvNormal}(\mu, \Sigma)
+% \end{align}
+% I am not familiar with typical approaches to priors on the covariance matrix,
+% so this will require a further literature search as to best practices.
+
+% \subsection{Finding Reasonable Priors}
+%
+% In standard bayesian regression, heavy tailed priors are common. 
+% When working with a bayesian bernoulli-logit model, this is not appropriate as 
+% heavy tails cause the estimated probabilities $p_n$ to concentrate around the 
+% values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in
+% \cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this.
+%
+% I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using
+% prior predictive checks to evaluate the implications of different priors
+% on the distribution on $p_n$.
+% This would consist of taking the independent variables and predicting the values
+% of $p_n$ based on a proposed set of priors. 
+% By plotting these predictions, I can ensure that the specific parameter priors 
+% used are consistent with my prior beliefs on how $p_n$ behaves.
+% Currently I believe that $p_n$ should be roughly uniform or unimodal, centered 
+% around $p_n = \frac{1}{2}$.
+%
 
 \subsection{Imputing Enrollment}
 
@@ -81,21 +81,17 @@ found a way to do so.
 
 \subsection{Improving Measures of Market Conditions}
 
-Finally, the currently employed measure of market conditions -- the number of 
-brands using the same active ingredients -- is not a very good measure of 
-the options available to potential participants of a clinical trial.
-The ideal measures would capture the alternatives available to treat a given
-disease (drug meeting the given indication) at the time of the trial snapshot, 
-but this data is hard to come by.
 In addition to the fact that many diseases may be treated by non-pharmaceutical 
 means, off-label prescription of pharmaceuticals is legal at the federal level 
 (\cite{commissioner_understanding_2019}).
 These two facts both complicate measuring market conditions.
-
-One dataset that I have only investigated briefly is the \url{DrugCentral.org}
-database which tracks official indications and some off-label indications as 
-well
-(\cite{ursu_drugcentral_2017}).
+One way to address non-pharmaceutical treatments is to concentrate on domains
+that are primarily treated by pharmaceuticals.
+This requires domain knowledge that I don't have.
+% One dataset that I have only investigated briefly is the \url{DrugCentral.org}
+% database which tracks official indications and some off-label indications as 
+% well
+% (\cite{ursu_drugcentral_2017}).
 
 
 \end{document}