Merge branch 'main'

claude_rewrite
Will King 1 year ago
commit 6f03d6ba08

@ -131,11 +131,13 @@ drug compound may be packaged in multiple ways, e.g. boxes with different
numbers of blister packs.
These SPLs are made available for download so that they can be integrated
into patient health systems to improve patient safety
\cite{indexingsplfactsheet_}.
\cite{usfda_splfactsheet_2023}.
The FDA also published additional data in the NDC SPL Data Elements (NSDE) file.
This file contains some of the data from the SPL files, as well as the dates
when each product was approved for sale and when it was removed from the market.
This summary of SPLs is what I used to find which drugs were approved
to be on the market at a given date.
%Structured Product Labels and dates of marketing
% Key features
@ -186,7 +188,7 @@ In each section below I briefly describe each terminology, its contents, and use
The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National
Library of Medicine.
It is used to index subjects in various NLM publications including PubMed
\cite{medicalsubjectheadingshomepage_}.
\cite{usnlm_meshhomepage_2023}.
The AACT database contains a table that links clinical trials' clinical conditions
and drug names to terms in the MeSH thesaurus.
As this contains a standardized nomenclature, it simplified much of the
@ -216,7 +218,7 @@ The one I chose to use was a MariaDB database that backs a service called RxNav
provided by the National Library of Medicine (NLM).
The NLM provides scripts to set up and host the backing databases on your
own servers
\cite{usnlm_rxnaviabox_2023}.
\cite{usnlm_rxnavinabox_2023}.
After setting up the local server, I wrote a python program to export
the data from the RxNorm database and import it into the AACT Database.
This was required because the former uses a MariaDB database server

@ -38,7 +38,7 @@ The betas are distributed
\beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
\end{align}
With hyperpriors
%Checked on 2024-11-27. Is corrrect. \todo{Double check that these are the priors I used.}
%Checked on 2024-11-27. Is corrrect.
\begin{align}
\mu_k \sim \text{Normal}(0,0.05) \\
\sigma_k \sim \text{Gamma}(4,20)

@ -24,6 +24,11 @@ relationship between the number of snapshots and the duration of trials.
We can see this in Figure \ref{fig:snapshot_duration_scatter}, where
the correlation (measured at $0.34$) is apparent.
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
\caption{Histogram of the count of Snapshots}
\label{fig:snapshot_counts}
\end{figure}
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted}
@ -31,23 +36,17 @@ the correlation (measured at $0.34$) is apparent.
\label{fig:trial_durations}
\end{figure}
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
\caption{Histogram of the count of Snapshots}
\label{fig:snapshot_counts}
\end{figure}
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination}
\caption{Scatterplot comparing the Count of Snapshots and Trial Duration}
\label{fig:snapshot_counts}
\label{fig:snapshot_duration_scatter}
\end{figure}
% Estimation Procedure
I fit the econometric model using mc-stan
\cite{standevelopmentteamStanModellingUsersGuide2022}
\cite{standevelopmentteam_stanmodellingusersguide_2022}
through the rstan
\cite{standevelopmentteamRStanInterfaceStan2023}
\cite{standevelopmentteam_rstaninterfacestan_2023}
interface using 4 chains with
%describe
2,500
@ -58,17 +57,18 @@ sampling iterations each.
Two of the chains experienced a low
Estimated Baysian Fraction of Missing Information (E-BFMI) ,
suggesting that there are some parts of the posterior distribution
that were not explored well during the model fitting.
that were not explored well during the model fitting
\cite{standevelopmentteam_runtimewarningsconvergence_2022}.
I presume this is due to the low number of trials in some of the
ICD-10 categories.
We can see in Figure \ref{fig:barchart_idc_categories} that some of these
We can see in Figure \ref{FIG:barchart_idc_categories} that some of these
disease categories had a single trial represented while others were
not represented at all.
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts}
\caption{Bar chart of trials by ICD-10 categories}
\label{fig:barchart_idc_categories}
\label{FIG:barchart_idc_categories}
\end{figure}
% Estimation Procedure
@ -308,20 +308,20 @@ Three points lead me to believe this:
\item The low fractions of E-BFMI suggest that the sampler is struggling
to explore some regions of the posterior.
According to
\cite{standevelopmentteam_RuntimeWarnings_2022}
\authorcite{standevelopmentteam_runtimewarningsconvergence_2022}
this is
often due to thick tails of posterior distributions.
\item When we examine the results across different ICD-10 groups,
\ref{fig:pred_dist_dif_delay2}
we note this same issue.
\item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories
\todo{add figure}
have
\item In Figure \ref{fig:parameters_ANR_by_group}, we see that some
ICD-10 categories have
\todo{note fat tails}.
\item There are few trials available, particularly among some specific
ICD-10 categories.
\todo{refer to figure ??}
\end{itemize}
\todo{Reformat so this refers to the original discussion of issues better.}
% - take a look at beta values and then discuss if that lines up with results from dist-diff by group.
% - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories.
% -

@ -18,19 +18,42 @@ the data may be available in a commercial dataset.
One of the original goals of this project was to examine the impact that
enrollment struggles have on the probability of trial termination.
Unfortunately, this requires a model of clinical trial enrollment, and the
data is just not in my dataset.
Unfortunately, this requires a model of clinical trial enrollment, and this
data is missing from my dataset.
In most cases the trial sponsor reports the anticipated enrollment value
while the trial is still recruiting and only updates the actual enrollment
after the trial has ended.
Some trials do publish an up to date record of their enrollment numbers, but this
is rare.
If a bayesian model of multisite enrollment can be developed for the disease categories
in question, then it will be possible to impute this missing data probabalistically,
which will allow me to estimate the direct effect of slow enrollment
\cite{mcelreath_statistical_2020}.
This does not exist yet, although some work on multi-site enrollment forecasting has
been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS}
Some trials do publish an incremental record of their enrollment numbers,
but this is rare.
Due to the bayesian model used, it would be possible to
include a model of the missing data
\cite{mcelreath_statisticalrethinkingbayesian_2020}.
which would
allow me to estimate the direct effect of slow enrollment
on clinical trial termination rates
There has been substantial work on forecasting
multi-site enrollment rates and durations by
\cite{
tozzi_predictingaccrualrate_1996,
carter_applicationstochasticprocesses_2004,
anisimov_modellingpredictionadaptive_2007,
zhang_stochasticmodelingprediction_2010,
zhang_jointmonitoringprediction_2012,
zhang_modelingpredictionsubject_2012,
heitjan_realtimepredictionclinical_2015,
jiang_modelingvalidatingbayesian_2015,
deng_bayesianmodelingprediction_2017,
lan_statisticalmodelingprediction_2019,
zhang_simplerobustmodel_2022,
urbas_interimrecruitmentprediction_2022,
bieganek_predictionclinicaltrial_2022,
avalos-pacheco_validationpredictiveanalyses_2023,
}
but choosing between the various single and multi-site models presented is
difficult without a dataset to validate the results on.
\subsection{Improving Population Estimates}
@ -39,23 +62,22 @@ population sizes that I have found so far.
Unfortunately, for some conditions it can be relatively imprecise due to
its focus on providing data geared towards public health policy.
For example, GBD contains categories for both
drug resistant and drug suceptible tuberculosis.
drug resistant and drug suceptible tuberculosis, but maps those to the same
ICD-10 code.
In contrast, there is no category for non-age related macular degeneration.
One resulting concern is that for a given ICD-10 code, the applicable GBD population
estimates may act as an estimate of the upper bound of population size
(\cite{global_burden_of_disease_collective_network_global_2020}).
The dataset contains various measures of disease severity, so it may be
worth investigating how to incorporate some of those measures.
Thus not every trial has a good match with the estimate of the population of
interest.
\subsection{Improving Measures of Market Conditions}
% Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases)
In addition to the fact that many diseases may be treated by non-pharmaceutical
means, off-label prescription of pharmaceuticals is legal at the federal level
(\cite{commissioner_understanding_2019}).
These two facts both complicate measuring market conditions.
means (e.g. diet, physical therapy, medical devices, etc),
off-label prescription of pharmaceuticals is legal at the federal level
\cite{commissioner_understandingunapproveduse_2019}.
These two facts both complicate measuring competing treatments,
a key part of market conditions.
One way to address non-pharmaceutical treatments is to concentrate on domains
that are primarily treated by pharmaceuticals.
Another way to address this would be to focus the analysis on just a few specific

@ -219,8 +219,9 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
These are measured by the DALY cost of the disease, and is
separated by the impact on countries with
High, High-Medium, Medium, Medium-Low, and Low
development scores.
This data comes from the Institute for Health Metrics' Global Burden of Disease study.
Socio-Demographic Index (SDI) scores.
This data comes from the Institute for Health Metrics' Global Burden of Disease study
\cite{vos_globalburden369_2020}.
\item \texttt{Elapsed Duration}:
A normalized measure of the time elapsed in the trial.
Comes from the original estimate of the trial's primary completion date and the registered start date.

File diff suppressed because it is too large Load Diff

@ -33,3 +33,6 @@
*** 2025-01-20 Monday
**** TODO get a citation for the AACT project
[[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/10_CausalStory.tex::114]]]]
*** 2025-01-23 Thursday
**** TODO Pickup citation fixes here
[[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/06_Results.tex::174]]]]

Loading…
Cancel
Save