JobMarketPaper/Paper/sections/08_PotentialImprovements.tex

\documentclass[../Main.tex]{subfiles}
\graphicspath{{\subfix{Assets/img/}}}

\begin{document}

As noted above, there are various issues with the analysis as completed so far.
Below I discuss various steps that I believe will improve the analysis.

\subsection{Increasing number of observations}

The most important step is to increase the number of observations available.
Currently this requires matching trials to ICD-10 codes by hand, but
there are certainly some steps that can be taken to improve the speed with which
this can be done.

\subsection{Covariance Structure}

As noted in the diagnostics section, many of the convergence issues seem
to occure in the covariance structure.
Instead of representing the parameters $\beta$ as independently normal:
\begin{align}
    \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k)
\end{align}
I propose using a multivariate normal distribution:
\begin{align}
    \beta(d) \sim \text{MvNormal}(\mu, \Sigma)
\end{align}
I am not familiar with typical approaches to priors on the covariance matrix,
so this will require a further literature search as to best practices.

\subsection{Finding Reasonable Priors}

In standard bayesian regression, heavy tailed priors are common.
When working with a bayesian bernoulli-logit model, this is not appropriate as
heavy tails cause the estimated probabilities $p_n$ to concentrate around the
values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in
\cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this.

I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using
prior predictive checks to evaluate the implications of different priors
on the distribution on $p_n$.
This would consist of taking the independent variables and predicting the values
of $p_n$ based on a proposed set of priors.
By plotting these predictions, I can ensure that the specific parameter priors
used are consistent with my prior beliefs on how $p_n$ behaves.
Currently I believe that $p_n$ should be roughly uniform or unimodal, centered
around $p_n = \frac{1}{2}$.


\subsection{Imputing Enrollment}

Finally, I must address the issue of how enrollment is reported.
In many cases, the trial continues to report an anticipated enrollment value
while the trial is still recruiting.
Thus using anticipated enrollment figures is inappropriate.
I am planning on using bayesian imputation to estimate actual enrollment
when it has not yet occured.
This will require building a statistical model of the enrollment process.
One advantage this dataset has is that trial sponsors provide their anticipated
enrollment numbers, allowing me to use this in the prediction model.
Additionally, each snapshot contains the elapsed duration and current status of
the trial , which may help improve the prediction.
Although predicted enrollment will be imprecise, it explicitly accounts for
uncertanty in the imputation and dependent calculations \cite{mcelreath_statistical_2020}.

\subsection{Improving Population Estimates}

The Global Burden of Disease dataset contains the best estimates of disease
population sizes that I have found so far.
Unfortunately, for some conditions it can be relatively imprecise due to
its focus on providing data geared towards public health policy.
For example, GBD contains categories for both
drug resistant and drug suceptible tuberculosis.
In contrast, there is no category for non-age related macular degeneration.
One resulting concern is that for a given ICD-10 code, the applicable GBD population
estimates may act as an estimate of the upper bound of population size
(\cite{global_burden_of_disease_collective_network_global_2020}). %fix citation
I would like to explicitly address this in my model, although I have not
found a way to do so.


\subsection{Improving Measures of Market Conditions}

Finally, the currently employed measure of market conditions -- the number of
brands using the same active ingredients -- is not a very good measure of
the options available to potential participants of a clinical trial.
The ideal measures would capture the alternatives available to treat a given
disease (drug meeting the given indication) at the time of the trial snapshot,
but this data is hard to come by.
In addition to the fact that many diseases may be treated by non-pharmaceutical
means, off-label prescription of pharmaceuticals is legal at the federal level
(\cite{commissioner_understanding_2019}).
These two facts both complicate measuring market conditions.

One dataset that I have only investigated briefly is the \url{DrugCentral.org}
database which tracks official indications and some off-label indications as
well
(\cite{ursu_drugcentral_2017}).


\end{document}