diff --git a/Paper/Main.tex b/Paper/Main.tex index 79b47bf..b22035c 100644 --- a/Paper/Main.tex +++ b/Paper/Main.tex @@ -84,7 +84,7 @@ Section \ref{SEC:Results} discusses the results of the analysis. \subfile{sections/06_Results} %--------------------------------------------------------------- -\section{Improvements}\label{SEC:Improvements} +\section{Deficiencies and Improvements}\label{SEC:Improvements} %--------------------------------------------------------------- \subfile{sections/08_PotentialImprovements} diff --git a/Paper/jmp_layout_laptop.kdl b/Paper/jmp_layout_laptop.kdl index b2e6533..a3f056e 100644 --- a/Paper/jmp_layout_laptop.kdl +++ b/Paper/jmp_layout_laptop.kdl @@ -1,5 +1,5 @@ layout { - tab name="Main and Compile" cwd="~/research/phd_deliverables/jmp/Latex/Paper" hide_floating_panes=true focus=true { + tab name="Main and Compile" cwd="~/research/phd_deliverables/JobMarketPaper/Paper" hide_floating_panes=true focus=true { // This tab is where I manage main from. // it opens up Main.txt for my JMP, opens the pdf in okular (in a floating tab), and then get's ready to build the pdf. pane size=1 borderless=true { @@ -33,7 +33,7 @@ layout { } } - tab name="sections" cwd="~/research/phd_deliverables/jmp/Latex/Paper/sections" { + tab name="sections" cwd="~/research/phd_deliverables/JobMarketPaper/Paper" { pane size=1 borderless=true { plugin location="tab-bar" } @@ -56,7 +56,7 @@ layout { } } - tab name="git" cwd="~/research/phd_deliverables/jmp/Latex/Paper/" { + tab name="git" cwd="~/research/phd_deliverables/JobMarketPaper/Latex/Paper/" { pane size=1 borderless=true { plugin location="tab-bar" } diff --git a/Paper/sections/06_Results.tex b/Paper/sections/06_Results.tex index bf4a34e..f8694cd 100644 --- a/Paper/sections/06_Results.tex +++ b/Paper/sections/06_Results.tex @@ -27,21 +27,18 @@ the correlation (measured at $0.34$) is apparent. \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted} - \todo{Replace this graphic with the histogram of trial durations} \caption{Histograms of Trial Durations} \label{fig:trial_durations} \end{figure} \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots} - \todo{Replace this graphic with the histogram of snapshots} \caption{Histogram of the count of Snapshots} \label{fig:snapshot_counts} \end{figure} \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination} - \todo{Replace this graphic with the scatterplot comparing durations and snapshots} \caption{Scatterplot comparing the Count of Snapshots and Trial Duration} \label{fig:snapshot_counts} \end{figure} @@ -86,7 +83,6 @@ keeping enrollment open. \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/dist_diff_analysis/p_delay_intervention_distdiff_boxplot} - \todo{Replace this graphic with the histdiff with boxplot} \small{ Values near 1 indicate a near perfect increase in the probability of termination. @@ -160,9 +156,8 @@ Three points lead me to believe this: often due to thick tails of posterior distributions. \item When we examine the results across different ICD-10 groups, \ref{fig:pred_dist_dif_delay2} - \todo{move figure from below} we note this same issue. - \item In Figure \ref{fig:betas_delay}, we see that some some ICD-10 categories + \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories \todo{add figure} have \todo{note fat tails}. \item There are few trials available, particularly among some specific @@ -173,9 +168,11 @@ Three points lead me to believe this: % - % - % - -Overally it is hard to escape the conclusion that more data is needed across -many -- if not all -- of the disease categories. +We can examine the per-group distributions of differences in \ref{fig:pred_dist_dif_delay2} to +acertain that the high impact group does exist in each of the groups. +This lends credence to the idea that this is a modelling issue, potentially +due to the low amounts of data overall. Figure \ref{fig:pred_dist_dif_delay2} shows how this overall @@ -187,13 +184,21 @@ result comes from different disease categories. \end{figure} -\subsection{Secondary Results} % Examine beta parameters % - Little movement except where data is strong, general negative movement. Still really wide % - Note how they all learned (partial pooling) reduction in \beta from ANR? % - Need to discuss the 5 different states. Can't remember which one is dropped for the life of me. May need to fix parameterization. % - +Finally, in figure \ref{fig:parameters_ANR_by_group}, we can see the estimated distributions of the $\beta$ parameter for +the status: \textbf{Active, not recruiting}. +The prior distributions were centered on zero, but we can see that the pooled learning has moved the mean +values negative, representing reductions in the probability of termination across the board. +This decrease in the probability of termination is strongest in the categories of Neoplasms ($n=$), +Musculoskeletal diseases ($n=$), and Infections and Parasites ($n=$), the three categories with the most data. +As this is a comparison against the trial status XXX, we note that +\todo{The natural comparison I want to make is against the Recruting status. Do I want to redo this so that I can read that directly?It shouldn't affect the $\delta_p$ analysis, but this could probably use it.} +Overall, this suggests that extending a clinical trial's enrollment period will reduce the probability of termination. \begin{figure}[H] \includegraphics[width=\textwidth]{../assets/img/betas/parameter_across_groups/parameters_12_status_ANR} @@ -202,5 +207,9 @@ result comes from different disease categories. \end{figure} % - +Overally it is hard to escape the conclusion that more data is needed across +many -- if not all -- of the disease categories. \end{document} + + diff --git a/Paper/sections/08_PotentialImprovements.tex b/Paper/sections/08_PotentialImprovements.tex index c85cae7..2be03d0 100644 --- a/Paper/sections/08_PotentialImprovements.tex +++ b/Paper/sections/08_PotentialImprovements.tex @@ -4,64 +4,33 @@ \begin{document} As noted above, there are various issues with the analysis as completed so far. -Below I discuss various steps that I believe will improve the analysis. +Below I discuss various issues and ways to address them that I believe will improve the analysis. \subsection{Increasing number of observations} The most important step is to increase the number of observations available. -Currently this requires matching trials to ICD-10 codes by hand, but -there are certainly some steps that can be taken to improve the speed with which -this can be done. -% -% \subsection{Covariance Structure} -% -% As noted in the diagnostics section, many of the convergence issues seem -% to occure in the covariance structure. -% Instead of representing the parameters $\beta$ as independently normal: -% \begin{align} -% \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k) -% \end{align} -% I propose using a multivariate normal distribution: -% \begin{align} -% \beta(d) \sim \text{MvNormal}(\mu, \Sigma) -% \end{align} -% I am not familiar with typical approaches to priors on the covariance matrix, -% so this will require a further literature search as to best practices. +Currently this requires matching trials to ICD-10 codes by hand. +Improvements in Large-Language-Models may make this data more accessible, or +the data may be available in a commercial dataset. -% \subsection{Finding Reasonable Priors} -% -% In standard bayesian regression, heavy tailed priors are common. -% When working with a bayesian bernoulli-logit model, this is not appropriate as -% heavy tails cause the estimated probabilities $p_n$ to concentrate around the -% values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in -% \cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this. -% -% I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using -% prior predictive checks to evaluate the implications of different priors -% on the distribution on $p_n$. -% This would consist of taking the independent variables and predicting the values -% of $p_n$ based on a proposed set of priors. -% By plotting these predictions, I can ensure that the specific parameter priors -% used are consistent with my prior beliefs on how $p_n$ behaves. -% Currently I believe that $p_n$ should be roughly uniform or unimodal, centered -% around $p_n = \frac{1}{2}$. -% -\subsection{Imputing Enrollment} +\subsection{Enrollment Modelling} -Finally, I must address the issue of how enrollment is reported. -In many cases, the trial continues to report an anticipated enrollment value -while the trial is still recruiting. -Thus using anticipated enrollment figures is inappropriate. -I am planning on using bayesian imputation to estimate actual enrollment -when it has not yet occured. -This will require building a statistical model of the enrollment process. -One advantage this dataset has is that trial sponsors provide their anticipated -enrollment numbers, allowing me to use this in the prediction model. -Additionally, each snapshot contains the elapsed duration and current status of -the trial , which may help improve the prediction. -Although predicted enrollment will be imprecise, it explicitly accounts for -uncertanty in the imputation and dependent calculations \cite{mcelreath_statistical_2020}. +One of the original goals of this project was to examine the impact that +enrollment struggles have on the probability of trial termination. +Unfortunately, this requires a model of clinical trial enrollment, and the +data is just not in my dataset. +In most cases the trial sponsor reports the anticipated enrollment value +while the trial is still recruiting and only updates the actual enrollment +after the trial has ended. +Some trials do publish an up to date record of their enrollment numbers, but this +is rare. +If a bayesian model of multisite enrollment can be developed for the disease categories +in question, then it will be possible to impute this missing data probabalistically, +which will allow me to estimate the direct effect of slow enrollment +\cite{mcelreath_statistical_2020}. +This does not exist yet, although some work on multi-site enrollment forecasting has +been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS} \subsection{Improving Population Estimates} @@ -74,24 +43,31 @@ drug resistant and drug suceptible tuberculosis. In contrast, there is no category for non-age related macular degeneration. One resulting concern is that for a given ICD-10 code, the applicable GBD population estimates may act as an estimate of the upper bound of population size -(\cite{global_burden_of_disease_collective_network_global_2020}). %fix citation -I would like to explicitly address this in my model, although I have not -found a way to do so. +(\cite{global_burden_of_disease_collective_network_global_2020}). +The dataset contains various measures of disease severity, so it may be +worth investigating how to incorporate some of those measures. \subsection{Improving Measures of Market Conditions} +% Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases) + In addition to the fact that many diseases may be treated by non-pharmaceutical means, off-label prescription of pharmaceuticals is legal at the federal level (\cite{commissioner_understanding_2019}). These two facts both complicate measuring market conditions. One way to address non-pharmaceutical treatments is to concentrate on domains that are primarily treated by pharmaceuticals. -This requires domain knowledge that I don't have. -% One dataset that I have only investigated briefly is the \url{DrugCentral.org} -% database which tracks official indications and some off-label indications as -% well -% (\cite{ursu_drugcentral_2017}). +Another way to address this would be to focus the analysis on just a few specific +diseases, for which a history of treatment options can be compiled. +This second approach may also allow the researcher to distinguish the direction +of causality between population size and number of drugs on the market; +for example, drugs to treat a chronic, non-fatal disease will probably not +affect the market size much in the short to medium term. +This allows the effect of market conditions to be isolated from +the effects of the population. +% Alternative approaches +% - diseases with constant kill rates? population effect should be relatively constant? \end{document} diff --git a/Paper/sections/09_Conclusion.tex b/Paper/sections/09_Conclusion.tex index 2ddd730..fdce845 100644 --- a/Paper/sections/09_Conclusion.tex +++ b/Paper/sections/09_Conclusion.tex @@ -6,9 +6,7 @@ Identifying commercial impediments to successfully completing clinical trials in otherwise capable pharmaceuticals will hopefully lead to a more robust and competitive market. Although the current state of this research is insufficient to draw robust -conclusions, early results suggest that enrollment rates have some impact -on whether or not a clinical trial terminates early or continues -to full completion. - +conclusions, these early results suggest that delaying the close of enrollment periods +reduces the probability of termination of a trial. \end{document}