Merge branch 'main'

claude_rewrite
Will King 1 year ago
commit 6f03d6ba08

@ -131,11 +131,13 @@ drug compound may be packaged in multiple ways, e.g. boxes with different
numbers of blister packs. numbers of blister packs.
These SPLs are made available for download so that they can be integrated These SPLs are made available for download so that they can be integrated
into patient health systems to improve patient safety into patient health systems to improve patient safety
\cite{indexingsplfactsheet_}. \cite{usfda_splfactsheet_2023}.
The FDA also published additional data in the NDC SPL Data Elements (NSDE) file. The FDA also published additional data in the NDC SPL Data Elements (NSDE) file.
This file contains some of the data from the SPL files, as well as the dates This file contains some of the data from the SPL files, as well as the dates
when each product was approved for sale and when it was removed from the market. when each product was approved for sale and when it was removed from the market.
This summary of SPLs is what I used to find which drugs were approved
to be on the market at a given date.
%Structured Product Labels and dates of marketing %Structured Product Labels and dates of marketing
% Key features % Key features
@ -186,7 +188,7 @@ In each section below I briefly describe each terminology, its contents, and use
The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National
Library of Medicine. Library of Medicine.
It is used to index subjects in various NLM publications including PubMed It is used to index subjects in various NLM publications including PubMed
\cite{medicalsubjectheadingshomepage_}. \cite{usnlm_meshhomepage_2023}.
The AACT database contains a table that links clinical trials' clinical conditions The AACT database contains a table that links clinical trials' clinical conditions
and drug names to terms in the MeSH thesaurus. and drug names to terms in the MeSH thesaurus.
As this contains a standardized nomenclature, it simplified much of the As this contains a standardized nomenclature, it simplified much of the
@ -216,7 +218,7 @@ The one I chose to use was a MariaDB database that backs a service called RxNav
provided by the National Library of Medicine (NLM). provided by the National Library of Medicine (NLM).
The NLM provides scripts to set up and host the backing databases on your The NLM provides scripts to set up and host the backing databases on your
own servers own servers
\cite{usnlm_rxnaviabox_2023}. \cite{usnlm_rxnavinabox_2023}.
After setting up the local server, I wrote a python program to export After setting up the local server, I wrote a python program to export
the data from the RxNorm database and import it into the AACT Database. the data from the RxNorm database and import it into the AACT Database.
This was required because the former uses a MariaDB database server This was required because the former uses a MariaDB database server

@ -38,7 +38,7 @@ The betas are distributed
\beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I) \beta(d_i) \sim \text{Normal}(\mu_i,\sigma_i I)
\end{align} \end{align}
With hyperpriors With hyperpriors
%Checked on 2024-11-27. Is corrrect. \todo{Double check that these are the priors I used.} %Checked on 2024-11-27. Is corrrect.
\begin{align} \begin{align}
\mu_k \sim \text{Normal}(0,0.05) \\ \mu_k \sim \text{Normal}(0,0.05) \\
\sigma_k \sim \text{Gamma}(4,20) \sigma_k \sim \text{Gamma}(4,20)

@ -24,6 +24,11 @@ relationship between the number of snapshots and the duration of trials.
We can see this in Figure \ref{fig:snapshot_duration_scatter}, where We can see this in Figure \ref{fig:snapshot_duration_scatter}, where
the correlation (measured at $0.34$) is apparent. the correlation (measured at $0.34$) is apparent.
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
\caption{Histogram of the count of Snapshots}
\label{fig:snapshot_counts}
\end{figure}
\begin{figure}[H] \begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted} \includegraphics[width=\textwidth]{../assets/img/trials_details/HistTrialDurations_Faceted}
@ -31,23 +36,17 @@ the correlation (measured at $0.34$) is apparent.
\label{fig:trial_durations} \label{fig:trial_durations}
\end{figure} \end{figure}
\begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/HistSnapshots}
\caption{Histogram of the count of Snapshots}
\label{fig:snapshot_counts}
\end{figure}
\begin{figure}[H] \begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination} \includegraphics[width=\textwidth]{../assets/img/trials_details/SnapshotsVsDurationVsTermination}
\caption{Scatterplot comparing the Count of Snapshots and Trial Duration} \caption{Scatterplot comparing the Count of Snapshots and Trial Duration}
\label{fig:snapshot_counts} \label{fig:snapshot_duration_scatter}
\end{figure} \end{figure}
% Estimation Procedure % Estimation Procedure
I fit the econometric model using mc-stan I fit the econometric model using mc-stan
\cite{standevelopmentteamStanModellingUsersGuide2022} \cite{standevelopmentteam_stanmodellingusersguide_2022}
through the rstan through the rstan
\cite{standevelopmentteamRStanInterfaceStan2023} \cite{standevelopmentteam_rstaninterfacestan_2023}
interface using 4 chains with interface using 4 chains with
%describe %describe
2,500 2,500
@ -58,17 +57,18 @@ sampling iterations each.
Two of the chains experienced a low Two of the chains experienced a low
Estimated Baysian Fraction of Missing Information (E-BFMI) , Estimated Baysian Fraction of Missing Information (E-BFMI) ,
suggesting that there are some parts of the posterior distribution suggesting that there are some parts of the posterior distribution
that were not explored well during the model fitting. that were not explored well during the model fitting
\cite{standevelopmentteam_runtimewarningsconvergence_2022}.
I presume this is due to the low number of trials in some of the I presume this is due to the low number of trials in some of the
ICD-10 categories. ICD-10 categories.
We can see in Figure \ref{fig:barchart_idc_categories} that some of these We can see in Figure \ref{FIG:barchart_idc_categories} that some of these
disease categories had a single trial represented while others were disease categories had a single trial represented while others were
not represented at all. not represented at all.
\begin{figure}[H] \begin{figure}[H]
\includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts} \includegraphics[width=\textwidth]{../assets/img/trials_details/CategoryCounts}
\caption{Bar chart of trials by ICD-10 categories} \caption{Bar chart of trials by ICD-10 categories}
\label{fig:barchart_idc_categories} \label{FIG:barchart_idc_categories}
\end{figure} \end{figure}
% Estimation Procedure % Estimation Procedure
@ -308,20 +308,20 @@ Three points lead me to believe this:
\item The low fractions of E-BFMI suggest that the sampler is struggling \item The low fractions of E-BFMI suggest that the sampler is struggling
to explore some regions of the posterior. to explore some regions of the posterior.
According to According to
\cite{standevelopmentteam_RuntimeWarnings_2022} \authorcite{standevelopmentteam_runtimewarningsconvergence_2022}
this is this is
often due to thick tails of posterior distributions. often due to thick tails of posterior distributions.
\item When we examine the results across different ICD-10 groups, \item When we examine the results across different ICD-10 groups,
\ref{fig:pred_dist_dif_delay2} \ref{fig:pred_dist_dif_delay2}
we note this same issue. we note this same issue.
\item In Figure \ref{fig:parameters_ANR_by_group}, we see that some some ICD-10 categories \item In Figure \ref{fig:parameters_ANR_by_group}, we see that some
\todo{add figure} ICD-10 categories have
have
\todo{note fat tails}. \todo{note fat tails}.
\item There are few trials available, particularly among some specific \item There are few trials available, particularly among some specific
ICD-10 categories. ICD-10 categories.
\todo{refer to figure ??}
\end{itemize} \end{itemize}
\todo{Reformat so this refers to the original discussion of issues better.}
% - take a look at beta values and then discuss if that lines up with results from dist-diff by group. % - take a look at beta values and then discuss if that lines up with results from dist-diff by group.
% - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories. % - My initial thought is that there is not enough data/too uncertain. I think this because it happens for most/all of the categories.
% - % -

@ -18,19 +18,42 @@ the data may be available in a commercial dataset.
One of the original goals of this project was to examine the impact that One of the original goals of this project was to examine the impact that
enrollment struggles have on the probability of trial termination. enrollment struggles have on the probability of trial termination.
Unfortunately, this requires a model of clinical trial enrollment, and the Unfortunately, this requires a model of clinical trial enrollment, and this
data is just not in my dataset. data is missing from my dataset.
In most cases the trial sponsor reports the anticipated enrollment value In most cases the trial sponsor reports the anticipated enrollment value
while the trial is still recruiting and only updates the actual enrollment while the trial is still recruiting and only updates the actual enrollment
after the trial has ended. after the trial has ended.
Some trials do publish an up to date record of their enrollment numbers, but this Some trials do publish an incremental record of their enrollment numbers,
is rare. but this is rare.
If a bayesian model of multisite enrollment can be developed for the disease categories Due to the bayesian model used, it would be possible to
in question, then it will be possible to impute this missing data probabalistically, include a model of the missing data
which will allow me to estimate the direct effect of slow enrollment \cite{mcelreath_statisticalrethinkingbayesian_2020}.
\cite{mcelreath_statistical_2020}. which would
This does not exist yet, although some work on multi-site enrollment forecasting has allow me to estimate the direct effect of slow enrollment
been done by \cite{CHECK ZOTERO NOTES FOR CITATIONS} on clinical trial termination rates
There has been substantial work on forecasting
multi-site enrollment rates and durations by
\cite{
tozzi_predictingaccrualrate_1996,
carter_applicationstochasticprocesses_2004,
anisimov_modellingpredictionadaptive_2007,
zhang_stochasticmodelingprediction_2010,
zhang_jointmonitoringprediction_2012,
zhang_modelingpredictionsubject_2012,
heitjan_realtimepredictionclinical_2015,
jiang_modelingvalidatingbayesian_2015,
deng_bayesianmodelingprediction_2017,
lan_statisticalmodelingprediction_2019,
zhang_simplerobustmodel_2022,
urbas_interimrecruitmentprediction_2022,
bieganek_predictionclinicaltrial_2022,
avalos-pacheco_validationpredictiveanalyses_2023,
}
but choosing between the various single and multi-site models presented is
difficult without a dataset to validate the results on.
\subsection{Improving Population Estimates} \subsection{Improving Population Estimates}
@ -39,23 +62,22 @@ population sizes that I have found so far.
Unfortunately, for some conditions it can be relatively imprecise due to Unfortunately, for some conditions it can be relatively imprecise due to
its focus on providing data geared towards public health policy. its focus on providing data geared towards public health policy.
For example, GBD contains categories for both For example, GBD contains categories for both
drug resistant and drug suceptible tuberculosis. drug resistant and drug suceptible tuberculosis, but maps those to the same
ICD-10 code.
In contrast, there is no category for non-age related macular degeneration. In contrast, there is no category for non-age related macular degeneration.
One resulting concern is that for a given ICD-10 code, the applicable GBD population Thus not every trial has a good match with the estimate of the population of
estimates may act as an estimate of the upper bound of population size interest.
(\cite{global_burden_of_disease_collective_network_global_2020}).
The dataset contains various measures of disease severity, so it may be
worth investigating how to incorporate some of those measures.
\subsection{Improving Measures of Market Conditions} \subsection{Improving Measures of Market Conditions}
% Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases) % Deficiency: cannot measure effect of market conditions because of endogenetiy of population and market conditions (fatal diseases)
In addition to the fact that many diseases may be treated by non-pharmaceutical In addition to the fact that many diseases may be treated by non-pharmaceutical
means, off-label prescription of pharmaceuticals is legal at the federal level means (e.g. diet, physical therapy, medical devices, etc),
(\cite{commissioner_understanding_2019}). off-label prescription of pharmaceuticals is legal at the federal level
These two facts both complicate measuring market conditions. \cite{commissioner_understandingunapproveduse_2019}.
These two facts both complicate measuring competing treatments,
a key part of market conditions.
One way to address non-pharmaceutical treatments is to concentrate on domains One way to address non-pharmaceutical treatments is to concentrate on domains
that are primarily treated by pharmaceuticals. that are primarily treated by pharmaceuticals.
Another way to address this would be to focus the analysis on just a few specific Another way to address this would be to focus the analysis on just a few specific

@ -219,8 +219,9 @@ A quick summary of the nodes of the DAG, the exact representation in the data, a
These are measured by the DALY cost of the disease, and is These are measured by the DALY cost of the disease, and is
separated by the impact on countries with separated by the impact on countries with
High, High-Medium, Medium, Medium-Low, and Low High, High-Medium, Medium, Medium-Low, and Low
development scores. Socio-Demographic Index (SDI) scores.
This data comes from the Institute for Health Metrics' Global Burden of Disease study. This data comes from the Institute for Health Metrics' Global Burden of Disease study
\cite{vos_globalburden369_2020}.
\item \texttt{Elapsed Duration}: \item \texttt{Elapsed Duration}:
A normalized measure of the time elapsed in the trial. A normalized measure of the time elapsed in the trial.
Comes from the original estimate of the trial's primary completion date and the registered start date. Comes from the original estimate of the trial's primary completion date and the registered start date.

File diff suppressed because it is too large Load Diff

@ -33,3 +33,6 @@
*** 2025-01-20 Monday *** 2025-01-20 Monday
**** TODO get a citation for the AACT project **** TODO get a citation for the AACT project
[[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/10_CausalStory.tex::114]]]] [[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/10_CausalStory.tex::114]]]]
*** 2025-01-23 Thursday
**** TODO Pickup citation fixes here
[[[[file:/home/will/research/phd_deliverables/JobMarketPaper/Paper/sections/06_Results.tex::174]]]]

Loading…
Cancel
Save