\documentclass[../Main.tex]{subfiles} \graphicspath{{\subfix{Assets/img/}}} \begin{document} As noted above, there are various issues with the analysis as completed so far. Below I discuss various steps that I believe will improve the analysis. \subsection{Increasing number of observations} The most important step is to increase the number of observations available. Currently this requires matching trials to ICD-10 codes by hand, but there are certainly some steps that can be taken to improve the speed with which this can be done. \subsection{Covariance Structure} As noted in the diagnostics section, many of the convergence issues seem to occure in the covariance structure. Instead of representing the parameters $\beta$ as independently normal: \begin{align} \beta_k(d) \sim \text{Normal}(\mu_k, \sigma_k) \end{align} I propose using a multivariate normal distribution: \begin{align} \beta(d) \sim \text{MvNormal}(\mu, \Sigma) \end{align} I am not familiar with typical approaches to priors on the covariance matrix, so this will require a further literature search as to best practices. \subsection{Finding Reasonable Priors} In standard bayesian regression, heavy tailed priors are common. When working with a bayesian bernoulli-logit model, this is not appropriate as heavy tails cause the estimated probabilities $p_n$ to concentrate around the values $0$ and $1$, and away from values such as $\frac{1}{2}$ as discussed in \cite{mcelreath_statistical_2020}. %TODO: double check the chapter for this. I indend to take the general approach recommended in \cite{mcelreath_statistical_2020} of using prior predictive checks to evaluate the implications of different priors on the distribution on $p_n$. This would consist of taking the independent variables and predicting the values of $p_n$ based on a proposed set of priors. By plotting these predictions, I can ensure that the specific parameter priors used are consistent with my prior beliefs on how $p_n$ behaves. Currently I believe that $p_n$ should be roughly uniform or unimodal, centered around $p_n = \frac{1}{2}$. \subsection{Imputing Enrollment} Finally, I must address the issue of how enrollment is reported. In many cases, the trial continues to report an anticipated enrollment value while the trial is still recruiting. Thus using anticipated enrollment figures is inappropriate. I am planning on using bayesian imputation to estimate actual enrollment when it has not yet occured. This will require building a statistical model of the enrollment process. One advantage this dataset has is that trial sponsors provide their anticipated enrollment numbers, allowing me to use this in the prediction model. Additionally, each snapshot contains the elapsed duration and current status of the trial , which may help improve the prediction. Although predicted enrollment will be imprecise, it explicitly accounts for uncertanty in the imputation and dependent calculations \cite{mcelreath_statistical_2020}. \subsection{Improving Population Estimates} The Global Burden of Disease dataset contains the best estimates of disease population sizes that I have found so far. Unfortunately, for some conditions it can be relatively imprecise due to its focus on providing data geared towards public health policy. For example, GBD contains categories for both drug resistant and drug suceptible tuberculosis. In contrast, there is no category for non-age related macular degeneration. One resulting concern is that for a given ICD-10 code, the applicable GBD population estimates may act as an estimate of the upper bound of population size (\cite{global_burden_of_disease_collective_network_global_2020}). %fix citation I would like to explicitly address this in my model, although I have not found a way to do so. \subsection{Improving Measures of Market Conditions} Finally, the currently employed measure of market conditions -- the number of brands using the same active ingredients -- is not a very good measure of the options available to potential participants of a clinical trial. The ideal measures would capture the alternatives available to treat a given disease (drug meeting the given indication) at the time of the trial snapshot, but this data is hard to come by. In addition to the fact that many diseases may be treated by non-pharmaceutical means, off-label prescription of pharmaceuticals is legal at the federal level (\cite{commissioner_understanding_2019}). These two facts both complicate measuring market conditions. One dataset that I have only investigated briefly is the \url{DrugCentral.org} database which tracks official indications and some off-label indications as well (\cite{ursu_drugcentral_2017}). \end{document}