diff --git a/Paper/Main.tex b/Paper/Main.tex index 79b47bf..b22035c 100644 --- a/Paper/Main.tex +++ b/Paper/Main.tex @@ -84,7 +84,7 @@ Section \ref{SEC:Results} discusses the results of the analysis. \subfile{sections/06_Results} %--------------------------------------------------------------- -\section{Improvements}\label{SEC:Improvements} +\section{Deficiencies and Improvements}\label{SEC:Improvements} %--------------------------------------------------------------- \subfile{sections/08_PotentialImprovements} diff --git a/Paper/jmp_layout_laptop.kdl b/Paper/jmp_layout_laptop.kdl index b2e6533..a3f056e 100644 --- a/Paper/jmp_layout_laptop.kdl +++ b/Paper/jmp_layout_laptop.kdl @@ -1,5 +1,5 @@ layout { - tab name="Main and Compile" cwd="~/research/phd_deliverables/jmp/Latex/Paper" hide_floating_panes=true focus=true { + tab name="Main and Compile" cwd="~/research/phd_deliverables/JobMarketPaper/Paper" hide_floating_panes=true focus=true { // This tab is where I manage main from. // it opens up Main.txt for my JMP, opens the pdf in okular (in a floating tab), and then get's ready to build the pdf. pane size=1 borderless=true { @@ -33,7 +33,7 @@ layout { } } - tab name="sections" cwd="~/research/phd_deliverables/jmp/Latex/Paper/sections" { + tab name="sections" cwd="~/research/phd_deliverables/JobMarketPaper/Paper" { pane size=1 borderless=true { plugin location="tab-bar" } @@ -56,7 +56,7 @@ layout { } } - tab name="git" cwd="~/research/phd_deliverables/jmp/Latex/Paper/" { + tab name="git" cwd="~/research/phd_deliverables/JobMarketPaper/Latex/Paper/" { pane size=1 borderless=true { plugin location="tab-bar" } diff --git a/Paper/sections/06_Results.tex b/Paper/sections/06_Results.tex index bf4a34e..7f3def9 100644 --- a/Paper/sections/06_Results.tex +++ b/Paper/sections/06_Results.tex @@ -27,21 +27,18 @@ the correlation (measured at $0.34$) is apparent. \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted} - \todo{Replace this graphic with the histogram of trial durations} \caption{Histograms of Trial Durations} \label{fig:trial_durations} \end{figure} \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots} - \todo{Replace this graphic with the histogram of snapshots} \caption{Histogram of the count of Snapshots} \label{fig:snapshot_counts} \end{figure} \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination} - \todo{Replace this graphic with the scatterplot comparing durations and snapshots} \caption{Scatterplot comparing the Count of Snapshots and Trial Duration} \label{fig:snapshot_counts} \end{figure} @@ -74,6 +71,34 @@ not represented at all. \label{fig:barchart_idc_categories} \end{figure} +% Estimation Procedure +I fit the econometric model using mc-stan +\cite{standevelopmentteam_StanModelling_2022} +through the rstan +\cite{standevelopmentteam_RStanInterface_2023} +interface using 4 chains with +%describe +2,500 +warmup iterations and +2,500 +sampling iterations each. + +Two of the chains experienced a low +Estimated Baysian Fraction of Missing Information (E-BFMI) , +suggesting that there are some parts of the posterior distribution +that were not explored well during the model fitting. +I presume this is due to the low number of trials in some of the +ICD-10 categories. +We can see in Figure \ref{fig:barchart_idc_categories} that some of these +disease categories had a single trial represented while others were +not represented at all. + +\begin{figure}[H] + \includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts} + \caption{Bar chart of trials by ICD-10 categories} + \label{fig:barchart_idc_categories} +\end{figure} + \subsection{Primary Results} @@ -202,5 +227,144 @@ result comes from different disease categories. \end{figure} % - +\subsection{Primary Results} + +The primary, causally-identified value we can estimate is the change in +the probability of termination caused by (counterfactually) keeping enrollment +open instead of closing enrollment when observed. +In figure \ref{fig:pred_dist_diff_delay} below, we see this impact of +keeping enrollment open. + + +\begin{figure}[H] + \includegraphics[width=\textwidth]{../assets/img/dist_diff_analysis/p_delay_intervention_distdiff_boxplot} + \small{ + Values near 1 indicate a near perfect increase in the probability + of termination. + Values near 0 indicate little change in probability, + while values near -1, represent a decrease in the probability + of termination. + The scale is in probability points, thus a value near 1 is a change + from unlikely to terminate under control, to highly likely to + terminate. + } + \caption{Histogram of the Distribution of Predicted Differences} + \label{fig:pred_dist_diff_delay} +\end{figure} + +There are a few interesting things to point out here. +Let's start by getting aquainted with the details of the distribution above. +% - spike at 0 +% - the boxplot +% - 63% of mass below 0 : find better way to say that +% - For a random trial, there is a 63% chance that the impact is to reduce the probability of a termination. +% - 2 pctg-point wide band centered on 0 has ~13% of the masss +% - mean represents 9.x% increase in probability of termination. A quick simulation gives about the same pctg-point increase in terminated trials. + +A few interesting interpretation bits come out of this. +% - there are 3 regimes: low impact (near zero), medium impact (concentrated in decreased probability of termination), and high impact (concentrated in increased probability of termination). +The first this that there appear to be three different regimes. +The first regime consists of the low impact results, i.e. those values of $\delta_p$ +near zero. +About 13\% of trials lie within a single percentage point change of zero, +suggesting that there is a reasonable chance that delaying +a close of enrollment has no impact. +The second regime consists of the moderate impact on clinical trials' +probabilities of termination, say values in the interval $[-0.5, 0.5]$ +on the graph. +Most of this probability mass is represents a decrease in the probability of +a termination, some of it rather large. +Finally, there exists the high impact region, almost exclusively concentrated +around increases in the probability of termination at $\delta_p > 0.75$. +These represent cases where delaying the close of enrollemnt changes a trial +from a case where they were highly likely to complete their primary objectives to +a case where they were likely or almost certain to terminate the trial early. +% - the high impact regime is strange because it consists of trials that moved from unlikely (<20% chance) of termination to a high chance (>80% chance) of termination. Something like 5% of all trials have a greater than 98 percentage point increase in termination. Not sure what this is doing. + +% - Potential Explanations for high impact regime: +How could this intervention have such a wide range in the intensity +and direction of impacts? +A few explanations include that some trials are suceptable or that this is a +result of too little data. +% - Some trials are highly suceptable. This is the face value effect +One option is that some categories are more suceptable to +issues with participant enrollment. +If this is the case, we should be able to isolate categories that contribute +the most to this effect. +Another is that this might be a modelling artefact, due to the relatively +low number of trials in certain ICD-10 categories. +In short, there might be high levels of uncertanty in some parameter values, +which manifest as fat tails in the distributions of the $\beta$ parameters. +Because of the logistic format of the model, these fat tails lead to +extreme values of $p$, and potentally large changes $\delta_p$. +% - Could be uncertanty. If the model is highly uncertain, e.g. there isn't enough data, we could have a small percentage of large increases. This could be in general or just for a few categories with low amounts of data. +% - +% - + +I believe that this second explanation -- a model artifact due to uncertanty -- +is likely to be the cause. +Three points lead me to believe this: +\begin{itemize} + \item The low fractions of E-BFMI suggest that the sampler is struggling + to explore some regions of the posterior. + According to \cite{standevelopmentteam_RuntimeWarnings_2022} this is + often due to thick tails of posterior distributions. + \item When we examine the results across different ICD-10 groups, + \ref{fig:pred_dist_dif_delay2} + we note this same issue. + \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories + \todo{add figure} + have \todo{note fat tails}. + \item There are few trials available, particularly among some specific + ICD-10 categories. +\end{itemize} +% - take a look at beta values and then discuss if that lines up with results from dist-diff by group. +% - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories. +% - +% - +% - + +We can examine the per-group distributions of differences in \ref{fig:pred_dist_dif_delay2} to +acertain that the high impact group does exist in each of the groups. +This lends credence to the idea that this is a modelling issue, potentially +due to the low amounts of data overall. + + +Figure \ref{fig:pred_dist_dif_delay2} shows how this overall +result comes from different disease categories. +\begin{figure}[H] + \includegraphics[width=\textwidth]{../assets/img/dist_diff_analysis/p_delay_intervention_distdiff_by_group} + \caption{Distribution of Predicted differences by Disease Group} + \label{fig:pred_dist_dif_delay2} +\end{figure} + + + +% Examine beta parameters +% - Little movement except where data is strong, general negative movement. Still really wide +% - Note how they all learned (partial pooling) reduction in \beta from ANR? +% - Need to discuss the 5 different states. Can't remember which one is dropped for the life of me. May need to fix parameterization. +% - +Finally, in figure \ref{fig:parameters_ANR_by_group}, we can see the estimated distributions of the $\beta$ parameter for +the status: \textbf{Active, not recruiting}. +The prior distributions were centered on zero, but we can see that the pooled learning has moved the mean +values negative, representing reductions in the probability of termination across the board. +This decrease in the probability of termination is strongest in the categories of Neoplasms ($n=$), +Musculoskeletal diseases ($n=$), and Infections and Parasites ($n=$), the three categories with the most data. +As this is a comparison against the trial status XXX, we note that +\todo{The natural comparison I want to make is against the Recruting status. Do I want to redo this so that I can read that directly?It shouldn't affect the $\delta_p$ analysis, but this could probably use it.} +Overall, this suggests that extending a clinical trial's enrollment period will reduce the probability of termination. + +\begin{figure}[H] + \includegraphics[width=\textwidth]{../assets/img/betas/parameter_across_groups/parameters_12_status_ANR} + \caption{Distribution of parameters associated with ``Active, not recruiting'' status, by ICD-10 Category} + \label{fig:parameters_ANR_by_group} +\end{figure} +% - + +Overally it is hard to escape the conclusion that more data is needed across +many -- if not all -- of the disease categories. \end{document} + + diff --git a/Paper/sections/08_PotentialImprovements.tex b/Paper/sections/08_PotentialImprovements.tex index c85cae7..2be03d0 100644 --- a/Paper/sections/08_PotentialImprovements.tex +++ b/Paper/sections/08_PotentialImprovements.tex @@ -4,64 +4,33 @@ \begin{document} As noted above, there are various issues with the analysis as completed so far. -Below I discuss various steps that I believe will improve the analysis. +Below I discuss various issues and ways to address them that I believe will improve the analysis. \subsection{Increasing number of observations} The most important step is to increase the number of observations available. -Currently this requires matching trials to ICD-10 codes by hand, but -there are certainly some steps that can be taken to improve the speed with which -this can be done. -% -% \subsection{Covariance Structure} -% -% As noted in the diagnostics section, many of the convergence issues seem -% to occure in the covariance structure. -% Instead of representing the parameters $\beta$ as independently normal: -% \begin{align} -% \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k) -% \end{align} -% I propose using a multivariate normal distribution: -% \begin{align} -% \beta(d) \sim \text{MvNormal}(\mu, \Sigma) -% \end{align} -% I am not familiar with typical approaches to priors on the covariance matrix, -% so this will require a further literature search as to best practices. +Currently this requires matching trials to ICD-10 codes by hand. +Improvements in Large-Language-Models may make this data more accessible, or +the data may be available in a commercial dataset. -% \subsection{Finding Reasonable Priors} -% -% In standard bayesian regression, heavy tailed priors are common. -% When working with a bayesian bernoulli-logit model, this is not appropriate as -% heavy tails cause the estimated probabilities $p_n$ to concentrate around the -% values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in -% \cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this. -% -% I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using -% prior predictive checks to evaluate the implications of different priors -% on the distribution on $p_n$. -% This would consist of taking the independent variables and predicting the values -% of $p_n$ based on a proposed set of priors. -% By plotting these predictions, I can ensure that the specific parameter priors -% used are consistent with my prior beliefs on how $p_n$ behaves. -% Currently I believe that $p_n$ should be roughly uniform or unimodal, centered -% around $p_n = \frac{1}{2}$. -% -\subsection{Imputing Enrollment} +\subsection{Enrollment Modelling} -Finally, I must address the issue of how enrollment is reported. -In many cases, the trial continues to report an anticipated enrollment value -while the trial is still recruiting. -Thus using anticipated enrollment figures is inappropriate. -I am planning on using bayesian imputation to estimate actual enrollment -when it has not yet occured. -This will require building a statistical model of the enrollment process. -One advantage this dataset has is that trial sponsors provide their anticipated -enrollment numbers, allowing me to use this in the prediction model. -Additionally, each snapshot contains the elapsed duration and current status of -the trial , which may help improve the prediction. -Although predicted enrollment will be imprecise, it explicitly accounts for -uncertanty in the imputation and dependent calculations \cite{mcelreath_statistical_2020}. +One of the original goals of this project was to examine the impact that +enrollment struggles have on the probability of trial termination. +Unfortunately, this requires a model of clinical trial enrollment, and the +data is just not in my dataset. +In most cases the trial sponsor reports the anticipated enrollment value +while the trial is still recruiting and only updates the actual enrollment +after the trial has ended. +Some trials do publish an up to date record of their enrollment numbers, but this +is rare. +If a bayesian model of multisite enrollment can be developed for the disease categories +in question, then it will be possible to impute this missing data probabalistically, +which will allow me to estimate the direct effect of slow enrollment +\cite{mcelreath_statistical_2020}. +This does not exist yet, although some work on multi-site enrollment forecasting has +been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS} \subsection{Improving Population Estimates} @@ -74,24 +43,31 @@ drug resistant and drug suceptible tuberculosis. In contrast, there is no category for non-age related macular degeneration. One resulting concern is that for a given ICD-10 code, the applicable GBD population estimates may act as an estimate of the upper bound of population size -(\cite{global_burden_of_disease_collective_network_global_2020}). %fix citation -I would like to explicitly address this in my model, although I have not -found a way to do so. +(\cite{global_burden_of_disease_collective_network_global_2020}). +The dataset contains various measures of disease severity, so it may be +worth investigating how to incorporate some of those measures. \subsection{Improving Measures of Market Conditions} +% Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases) + In addition to the fact that many diseases may be treated by non-pharmaceutical means, off-label prescription of pharmaceuticals is legal at the federal level (\cite{commissioner_understanding_2019}). These two facts both complicate measuring market conditions. One way to address non-pharmaceutical treatments is to concentrate on domains that are primarily treated by pharmaceuticals. -This requires domain knowledge that I don't have. -% One dataset that I have only investigated briefly is the \url{DrugCentral.org} -% database which tracks official indications and some off-label indications as -% well -% (\cite{ursu_drugcentral_2017}). +Another way to address this would be to focus the analysis on just a few specific +diseases, for which a history of treatment options can be compiled. +This second approach may also allow the researcher to distinguish the direction +of causality between population size and number of drugs on the market; +for example, drugs to treat a chronic, non-fatal disease will probably not +affect the market size much in the short to medium term. +This allows the effect of market conditions to be isolated from +the effects of the population. +% Alternative approaches +% - diseases with constant kill rates? population effect should be relatively constant? \end{document} diff --git a/Paper/sections/09_Conclusion.tex b/Paper/sections/09_Conclusion.tex index 2ddd730..fdce845 100644 --- a/Paper/sections/09_Conclusion.tex +++ b/Paper/sections/09_Conclusion.tex @@ -6,9 +6,7 @@ Identifying commercial impediments to successfully completing clinical trials in otherwise capable pharmaceuticals will hopefully lead to a more robust and competitive market. Although the current state of this research is insufficient to draw robust -conclusions, early results suggest that enrollment rates have some impact -on whether or not a clinical trial terminates early or continues -to full completion. - +conclusions, these early results suggest that delaying the close of enrollment periods +reduces the probability of termination of a trial. \end{document}