You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
361 lines
17 KiB
TeX
361 lines
17 KiB
TeX
\documentclass[../Main.tex]{subfiles}
|
|
\graphicspath{{\subfix{Assets/img/}}}
|
|
|
|
\begin{document}
|
|
In the sections below, I examine each source of data, their key features,
|
|
and describe applicable terminology (\cref{datasources}).
|
|
I then discuss how these sources were tied together (\cref{datalinks}) and
|
|
describe the specific data used in the analysis (\cref{dataintegration}).
|
|
|
|
\subsection{Data Sources}\label{datasources}
|
|
|
|
|
|
%----------------------------------------------------
|
|
\subsubsection{Clinical Trials Data}
|
|
%ClinicalTrials.gov
|
|
% Key features - brief description
|
|
% Snapshots
|
|
% Why is it being included
|
|
% What specific data is used
|
|
% Enrollment, status, duration, indication, compound
|
|
% Data Manipulations for each
|
|
% Normalized: Enrollment, duration
|
|
% Categorical-to dummies: status
|
|
% Linked: indication, compound
|
|
% counts: number of sponsor changes since start.
|
|
% Links to data
|
|
|
|
Since Sep 27th, 2007 those who conduct clinical trials of FDA controlled
|
|
drugs or devices on human subjects must register
|
|
their trial at \url{ClinicalTrials.gov}
|
|
(\cite{noauthor_fdaaa_nodate}).
|
|
This involves submitting information on the expected enrollment and duration of
|
|
trials, drugs or devices that will be used, treatment protocols and study arms,
|
|
as well as contact information the trial sponsor and treatment sites.
|
|
|
|
When starting a new trial, the required information must be submitted
|
|
``\dots not later than 21 calendar days after enrolling the first human subject\dots''.
|
|
After the initial submission, the data is briefly reviewed for quality and
|
|
then the trial record is published and the trial is assigned a
|
|
National Clinical Trial (NCT) identifier.
|
|
(\cite{noauthor_fdaaa_nodate}).
|
|
|
|
Each trial's record is updated periodically, including a final update that must occur
|
|
within a year of completing the primary objective, although exceptions are
|
|
available for trials related to drug approvals or for trials with secondary
|
|
objectives that require further observation\footnote{This rule came into effect in 2017}
|
|
(\cite{noauthor_fdaaa_nodate}).
|
|
Other than the requirements for the the first and last submissions, all other
|
|
updates occur at the discresion of the trial sponsor.
|
|
Because the ClinicalTrials.gov website serves as a central point of information
|
|
on which trials are active or recruting for a given condition or drug,
|
|
most trials are updated multiple times during their progression.
|
|
|
|
There are two primary ways to access data about clinical trials.
|
|
The first is to search individual trials on ClinicalTrials.gov with a web browser.
|
|
This web portal shows the current information about the trial and provides
|
|
access to snapshots of previously submitted information.
|
|
Together, these features fulfill most of the needs of those seeking
|
|
to join a clinical trial.
|
|
%include screenshots?
|
|
The second way to access the data is through a normalized database setup by
|
|
the
|
|
\href{https://aact.ctti-clinicaltrials.org/}{Clinical Trials Transformation Initiative}
|
|
called AACT. %TODO: Get CITATION
|
|
The AACT database is available as a PostgreSQL database dump or set of pipe (``$\vert$'')
|
|
delimited files and matches the current version of the ClinicalTrials.gov database.
|
|
This format is ameniable to large scale analysis, but does not contain information about
|
|
the past state of trials.
|
|
|
|
I created a set of python scripts to
|
|
incorporate the historical data on clinical trials available through the web
|
|
portal and merge it into a local copy of the standard AACT database.
|
|
This novel dataset can be used to easily track changes as trials progresss.
|
|
|
|
%describe the data NCT, trial records, mesh_terms, etc
|
|
In this combined dataset of current and historical trial records, there are a few
|
|
areas of particular interest.
|
|
\begin{itemize}
|
|
\item NCT: As a unique identifier of a trial, it is used throughout to
|
|
ensure data is linked to the appropriate trial.
|
|
\item Enrollment: This takes on two forms.
|
|
At the beginning of a trial this is presented as ``Anticipated''
|
|
enrollment, while near or at the end of the trial it is reported
|
|
as ``Actual'' enrollment.
|
|
\item Overall Status: Each trial must be in one of a list of states.
|
|
While a trial is running, it can be in any of the following states.
|
|
\begin{itemize}
|
|
\item Not yet recruiting
|
|
\item Recruiting
|
|
\item Enrolling by Invitation
|
|
\item Active, not recruiting
|
|
\item Suspended %I don't explicitly deal with this case
|
|
\end{itemize}
|
|
When a trial has ended it is in one of two states:
|
|
\begin{itemize}
|
|
\item Terminated: Trial has ended premateurly
|
|
\item Completed: Trial has ended after observing what they hoped to observe.
|
|
\end{itemize}
|
|
\item Start Date: The date that the first measurement was taken or that the
|
|
first site was authorized to take measurements.
|
|
\item Primary Completion Date: The date the last measurement for the primary
|
|
objective was taken.
|
|
Prior to the actual primary completion date, this is an anticipated value.
|
|
\item Conditions: The conditions of interest in the trial.
|
|
\item Interventions: The drug(s) used in treatment.
|
|
\end{itemize}
|
|
|
|
|
|
|
|
%----------------------------------------------------
|
|
\subsubsection{Drug Compounds and Structured Product Labels (SPLs)}
|
|
|
|
When a drug is licensed for sale in the U.S., it is not just the active
|
|
ingredients that are licensed, but also the dosage and route of administration.
|
|
Each of these combined compound/dosage/route pairs are assigned a unique
|
|
National Drug Code (NDC).
|
|
%mention orange book
|
|
The list of approved NDCs are released regularly in the FDA's
|
|
Orangebook (small-molecule drugs) and Purplebook (Biologicals) publications.
|
|
These two publications also contain information regarding which drugs are generics
|
|
or biosimilars. %TODO: REF
|
|
%which drugs are originators and which are generics (there is a better word for originator).
|
|
|
|
Before a drug or drug compound is sold on the market, the FDA requires the seller
|
|
to submit a standardized label and associated information called
|
|
a Structured Product Label (SPL).
|
|
These SPLs include information about dosage, ingredients, warnings, and printed labels.
|
|
Each NDC code can have multiple SPLs associated with it because each
|
|
drug compound may be packaged in multiple ways, e.g. boxes with different
|
|
numbers of blister packs.
|
|
These SPLs are made available for download so that they can be integrated
|
|
into patient health systems to improve patient safety (\cite{noauthor_indexing_nodate}).
|
|
|
|
The FDA also published additional data in the NDC SPL Data Elements (NSDE) file.
|
|
This file contains some of the data from the SPL files, as well as the dates
|
|
when each product was approved for sale and when it was removed from the market.
|
|
|
|
%Structured Product Labels and dates of marketing
|
|
% Key features
|
|
% Why is it being included
|
|
% What specific data is used
|
|
% compound, spl, marketing dates
|
|
% Data Manipulations for each
|
|
% Linked: compound, spl -> indication
|
|
% standardize start/end dates by getting a view: compound, dates, manufacturer.
|
|
% Links to data
|
|
|
|
%----------------------------------------------------
|
|
\subsubsection{Global Disease Burden Survey}
|
|
|
|
The University of Washington's Institute for Health Metrics and Evaluation
|
|
published a dataset called the Global Burdens of Disease Study 2019 (GBD 2019).
|
|
This dataset provides estimates of worldwide incidence of
|
|
various diseases and classes of diseases.
|
|
%\footnote{A full list of the diseases and categories can be found in \ref{Appendix1}}
|
|
The available measures of incidence include Deaths, Disability Adjusted Life Years (DALYs),
|
|
Years of Life Lost (YLL), and Years Lived with Disability (YLD) and come with
|
|
both an estimate and 95\% confidence interval bounds.
|
|
Estimates are available for national, multinational, and global
|
|
populations
|
|
(\cite{vos_global_2020}).
|
|
|
|
These classes of disease are organized in a hierarchy, with each subsuming category
|
|
having its own estimates of disease incidence.
|
|
One understandable defficiency in this dataset is that it doesn't account for all
|
|
diseases tracked in other datasets, but focuses on those
|
|
that are most important from a public health perspective.
|
|
%not quite sure how to fill this out. What I am hoping to do is to justify my use of
|
|
% the highest level (most precise categories of data. Might be better to discuss
|
|
% the nested category outline.
|
|
The IHME also provides a link between the disease/cause hierarchy and ICD10
|
|
codes
|
|
(\cite{global_burden_of_disease_collaborative_network_global_2020}).
|
|
|
|
|
|
%----------------------------------------------------
|
|
\subsubsection{Medical and Pharmacological Terminologies}\label{datalinks}
|
|
In order to link these disparate data sources I used multiple standardized
|
|
terminologies.
|
|
In each section below I briefly describe each terminology, its contents, and uses.
|
|
|
|
\paragraph{Medical Subject Headings (MeSH) Thesaurus}
|
|
|
|
The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National
|
|
Library of Medicine.
|
|
It is used to index subjects in various NLM publications including PubMed
|
|
(\cite{noauthor_medical_nodate}).
|
|
The AACT database contains a table that links clinical trials' clinical conditions
|
|
and drug names to terms in the MeSH thesaurus.
|
|
As this contains a standardized nomenclature, it simplified much of the
|
|
linking between clinical trials and other datasources.
|
|
|
|
\paragraph{RxNorm}
|
|
|
|
According to \cite{noauthor_rxnorm_nodate-1}
|
|
\begin{displayquote}
|
|
What is RxNorm? \\
|
|
RxNorm is two things: a normalized naming system for generic and branded drugs;
|
|
and a tool for supporting semantic interoperation between drug terminologies
|
|
and pharmacy knowledge base systems.\dots \\
|
|
\end{displayquote}
|
|
Both of these functions are crucial to the analysis.
|
|
The normalized naming system allowed me to convert a diverse
|
|
set of names as recorded for each clinical trial into standardized identifers.
|
|
These standardized identifiers are known as RxCUIs, and they are used in RxNorm
|
|
to identify not only individual drug components, but also brand names, licensed
|
|
drug/dosage pairs, and packages.
|
|
The links to other drug terminologies included links to SPL identifiers, which
|
|
permitted me to link each trial to drugs on the market at and point in time.
|
|
|
|
%How did I get and incoprorate this data.
|
|
The RxNorm data is provided in multiple formats.
|
|
The one I chose to use was a MariaDB database that backs a service called RxNav
|
|
provided by the National Library of Medicine (NLM).
|
|
The NLM provides scripts to set up and host the backing databases on your
|
|
own servers
|
|
(\cite{noauthor_rxnav---box_nodate}).
|
|
After setting up the local server, I wrote a python program to export
|
|
the data from the RxNorm database and import it into the AACT Database.
|
|
This was required because the former uses a MariaDB database server
|
|
and the latter uses a Postgres database server.
|
|
|
|
With the data now available alongside the AACT database, I could link trials
|
|
to various key drug concepts, including normalized drug ingredient names,
|
|
NDCs incorporating those ingredients, and the brand names associated with the NDCs.
|
|
|
|
\paragraph{International Classification of Diseases 10th revision (ICD-10)}
|
|
|
|
%what it is
|
|
The International Classification of Diseases 10th revision (ICD-10) is a
|
|
worldwide standard for categorizing human disease maintained by the
|
|
World Health Organization.
|
|
Although the WHO version's last major update was in 2019 and it was officially
|
|
superceded in 2022 by the 11th revision (\cite{noauthor_international_nodate}),
|
|
the 10th revision is still in use in the United States as the
|
|
Centers for Medicare and Medicaid Services (CMS) continues to publish
|
|
updated versions called
|
|
ICD-10-CM (Clinical Managment) (\cite{noauthor_2023_nodate}) and
|
|
ICD-10-PCS (Procedure Coding System)(\cite{noauthor_2023_nodate-1}) for use
|
|
in medical billing.
|
|
|
|
ICD-10 codes are organized in a heirarchy.
|
|
There are 22 highest level categories, representing general categories such
|
|
as cancers, mental illness, and infectious diseases.
|
|
The second layer of the hierarchy consists of about 225 2nd level groupings.
|
|
|
|
|
|
%how was it used
|
|
The GBD database provided a mapping between their categories and ICD-10
|
|
codes (\cite{global_burden_of_disease_collaborative_network_global_2020}).
|
|
Unfortunately it appears to use a combination of the default WHO ICD-10 codes
|
|
and the ICD-10-CM codes from the CMS.
|
|
Additionally, many diseases classified by ICD-10 codes do not correspond to
|
|
categories in the GDB database.
|
|
|
|
%how it was obtained
|
|
As I needed a combined list of ICD-10 codes, I first obtained the 2019 version
|
|
of the ICD-10-CM codes from the CMS (\cite{noauthor_2019_nodate}).
|
|
With the arrival of the ICD-11 system, it was difficult to find an official
|
|
source from which to download the WHO versions of ICD-10 codes.
|
|
Eventually I resorted to copying them from the navigation bar of the
|
|
\href{https://icd.who.int/browse10/2019/en}{official WHO ICD-10 (2019) website}
|
|
(\cite{noauthor_icd-10_nodate}.)
|
|
After getting both sources into the same format,
|
|
I combined them and removed duplicate codes, preferring to keep the descriptions
|
|
from the WHO version.
|
|
This was done using standard unix scripting commands.
|
|
I then imported the data into the Postgres Database alongside the AACT data.
|
|
|
|
\paragraph{Unified Medical Language System (UMLS) Thesarus}
|
|
|
|
The NLM also publishes a medical terminology thesaurus
|
|
known at the Unified Medical Language System (UMLS) which links terminologies
|
|
such as RxNorm, MeSH, and ICD-10.
|
|
It is made available through an API hosted by the NLM.
|
|
One key feature is the ability to use a basic text search to find matching
|
|
terms in various terminologies.
|
|
|
|
|
|
|
|
\subsection{Data Integration}\label{dataintegration}
|
|
%Goal - Help readers understand which data were used in the analysis
|
|
Below is more information about how the data was used in the analysis.
|
|
|
|
%Describe data pulled from AACT/historical snapshots
|
|
% what are snapshots.
|
|
% enrollment
|
|
% elapsed_duration
|
|
% current_status
|
|
For clinical trials, I captured each update that occured after the start date
|
|
and prior to the primary completion date of the trial.
|
|
For clarity I will refer to these as a snapshot of the trial.
|
|
|
|
For each snapshot I recorded the enrollment (actual or anticipated),
|
|
the date the it was submitted, the planned primary completion date,
|
|
and the trial's overall status at the time.
|
|
I also extracted the anticipated enrollment closest to the actual start date
|
|
of the trial, which I will call the planned enrollment under the assumption
|
|
that the sponsor is recording their current plan for enrollment.
|
|
From these I constructed a couple of normalized values.
|
|
|
|
The first is a normalized measure of enrollment.
|
|
This was constructed by dividing the snapshot enrollment by the planned enrollment.
|
|
The purpose of this was to normalize enrollment to a scale roughly around 1
|
|
instead of the widely varying counts that raw enrollment would give.
|
|
The second was a measure of how far along the trial was
|
|
in it's planned duration, in other word a measure of elapsed duration.
|
|
This was calculated for each snapshot as:
|
|
\begin{align}
|
|
\text{Elapsed Duration} =
|
|
\frac{\text{Snapshot Date} - \text{Start Date}}
|
|
{\text{Primary Completion Date (anticipated)} - \text{Start Date}}
|
|
\end{align}
|
|
Note that this has a range of $[0,\infty)$ although for practical
|
|
matters it is only about $[0,3]$. %good to put a graph here
|
|
I also included the current status by encoding it to dummy parameters.
|
|
|
|
%Describe linking drugs/getting number of brands
|
|
As an initial measure of market conditions I have gathered the number of brands
|
|
that are producing drugs containing the compound(s) of interest in the trial.
|
|
This was done by extracting the RxCUIs that represented the drugs of interest,
|
|
then linking those to the RxCUIs that are brands containing those ingredients.
|
|
|
|
|
|
As a secondary measure of market conditions, I linked clinical trials to the
|
|
USP Drug Classification list.
|
|
Once I had linked the drugs used in a trial to the applicable USP DC category
|
|
and class, I could find the number of alternative brands in that class.
|
|
This matching was performed by hand, using a custom web interface to the database.
|
|
|
|
In order to link clinical trials to standardized ICD-10 conditions and thus
|
|
to the Global Burdens of Disease Data, I wrote a python script to search the
|
|
UMLS system for ICD-10 codes that matched the MeSH descriptions for
|
|
each trial.
|
|
This search resulted in generally three categories of search results:
|
|
\begin{enumerate}
|
|
\item The results contained a few entries, one of which was obviously correct.
|
|
\item The results contained a large number of entries, a few of which were correct.
|
|
\item The results did not contain any matches.
|
|
\end{enumerate}
|
|
In these cases I needed a way to validate each match and potentially add my own
|
|
ICD-10 codes to each trial.
|
|
This matching was also performed by hand, using a separate custom web interface to the database.
|
|
|
|
The effort to manually match ICD-10 codes and USP DC categories and classes data is ongoing.
|
|
|
|
%Describe linking icd10 codes to GBD
|
|
% Not every icd10 code maps, so some trials are excluded.
|
|
%Describe categorizing icd10 codes
|
|
After manually matching each trial to an ICD-10 code, each trial is easily linked to
|
|
either one of the 22 highest level categories or the 225 or so 2nd level
|
|
categories in the ICD-10 hierarchy.
|
|
Linking to one of the disease categories in the GBD heirarchy is similarly easy.
|
|
To get the best estimate of the size of the population associated with a disease,
|
|
each trial is linked to the most specific disease category applicable.
|
|
As not every ICD-10 code is linked to a condition in the GBD, those without any
|
|
applicable conditions are dropped from the dataset.
|
|
|
|
|
|
\end{document}
|