JobMarketPaper/Latex/Paper/sections/02_data.tex

\documentclass[../Main.tex]{subfiles}
\graphicspath{{\subfix{Assets/img/}}}

\begin{document}
In the sections below, I examine each source of data, their key features,
and describe applicable terminology (\cref{datasources}).
I then discuss how these sources were tied together (\cref{datalinks}) and
describe the specific data used in the analysis (\cref{dataintegration}).

\subsection{Data Sources}\label{datasources}


%----------------------------------------------------
\subsubsection{Clinical Trials Data}
%ClinicalTrials.gov
%   Key features - brief description
%       Snapshots
%   Why is it being included
%   What specific data is used
%       Enrollment, status, duration, indication, compound
%   Data Manipulations for each
%       Normalized: Enrollment, duration
%       Categorical-to dummies: status
%       Linked: indication, compound
%       counts: number of sponsor changes since start.
%   Links to data

Since Sep 27th, 2007 those who conduct clinical trials of FDA controlled
drugs or devices on human subjects must register
their trial at \url{ClinicalTrials.gov}
(\cite{noauthor_fdaaa_nodate}).
This involves submitting information on the expected enrollment and duration of
trials, drugs or devices that will be used, treatment protocols and study arms,
as well as contact information the trial sponsor and treatment sites.

When starting a new trial, the required information must be submitted
``\dots not later than 21 calendar days after enrolling the first human subject\dots''.
After the initial submission, the data is briefly reviewed for quality and
then the trial record is published and the trial is assigned a
National Clinical Trial (NCT) identifier.
(\cite{noauthor_fdaaa_nodate}).

Each trial's record is updated periodically, including a final update that must occur
within a year of completing the primary objective, although exceptions are
available for trials related to drug approvals or for trials with secondary
objectives that require further observation\footnote{This rule came into effect in 2017}
(\cite{noauthor_fdaaa_nodate}).
Other than the requirements for the the first and last submissions, all other
updates occur at the discresion of the trial sponsor.
Because the ClinicalTrials.gov website serves as a central point of information
on which trials are active or recruting for a given condition or drug,
most trials are updated multiple times during their progression.

There are two primary ways to access data about clinical trials.
The first is to search individual trials on ClinicalTrials.gov with a web browser.
This web portal shows the current information about the trial and provides
access to snapshots of previously submitted information.
Together, these features fulfill most of the needs of those seeking
to join a clinical trial.
%include screenshots?
The second way to access the data is through a normalized database setup by
the
\href{https://aact.ctti-clinicaltrials.org/}{Clinical Trials Transformation Initiative}
called AACT. %TODO: Get CITATION
The AACT database is available as a PostgreSQL database dump or set of pipe (``$\vert$'')
delimited files and matches the current version of the ClinicalTrials.gov database.
This format is ameniable to large scale analysis, but does not contain information about
the past state of trials.

I created a set of python scripts to
incorporate the historical data on clinical trials available through the web
portal and merge it into a local copy of the standard AACT database.
This novel dataset can be used to easily track changes as  trials progresss.

%describe the data NCT, trial records, mesh_terms, etc
In this combined dataset of current and historical trial records, there are a few
areas of particular interest.
\begin{itemize}
    \item NCT: As a unique identifier of a trial, it is used throughout to
        ensure data is linked to the appropriate trial.
    \item Enrollment: This takes on two forms.
        At the beginning of a trial this is presented as ``Anticipated''
        enrollment, while near or at the end of the trial it is reported
        as ``Actual'' enrollment.
    \item Overall Status: Each trial must be in one of a list of states.
        While a trial is running, it can be in any of the following states.
        \begin{itemize}
            \item Not yet recruiting
            \item Recruiting
            \item Enrolling by Invitation
            \item Active, not recruiting
            \item Suspended %I don't explicitly deal with this case
        \end{itemize}
	When a trial has ended it is in one of two states:
        \begin{itemize}
            \item Terminated: Trial has ended premateurly
            \item Completed: Trial has ended after observing what they hoped to observe.
        \end{itemize}
    \item Start Date: The date that the first measurement was taken or that the
        first site was authorized to take measurements.
    \item Primary Completion Date: The date the last measurement for the primary
        objective was taken.
        Prior to the actual primary completion date, this is an anticipated value.
    \item Conditions: The conditions of interest in the trial.
    \item Interventions: The drug(s) used in treatment.
\end{itemize}


%----------------------------------------------------
\subsubsection{Drug Compounds and Structured Product Labels (SPLs)}

When a drug is licensed for sale in the U.S., it is not just the active
ingredients that are licensed, but also the dosage and route of administration.
Each of these combined compound/dosage/route pairs are assigned a unique
National Drug Code (NDC).
%mention orange book
The list of approved NDCs  are released regularly in the FDA's
Orangebook (small-molecule drugs) and Purplebook (Biologicals) publications.
These two publications also contain information regarding which drugs are generics
or biosimilars. %TODO: REF
%which drugs are originators and which are generics (there is a better word for originator).

Before a drug or drug compound is sold on the market, the FDA requires the seller
to submit a standardized label and associated information called
a Structured Product Label (SPL).
These SPLs include information about dosage, ingredients, warnings, and printed labels.
Each NDC code can have multiple SPLs associated with it because each
drug compound may be packaged in multiple ways, e.g. boxes with different
numbers of blister packs.
These SPLs are made available for download so that they can be integrated
into patient health systems to improve patient safety (\cite{noauthor_indexing_nodate}).

The FDA also published additional data in the NDC SPL Data Elements (NSDE) file.
This file contains some of the data from the SPL files, as well as the dates
when each product was approved for sale and when it was removed from the market.

%Structured Product Labels and dates of marketing
%   Key features
%   Why is it being included
%   What specific data is used
%       compound, spl, marketing dates
%   Data Manipulations for each
%       Linked: compound, spl -> indication
%       standardize start/end dates by getting a view: compound, dates, manufacturer.
%   Links to data

%----------------------------------------------------
\subsubsection{Global Disease Burden Survey}

The University of Washington's Institute for Health Metrics and Evaluation
published a dataset called the Global Burdens of Disease Study 2019 (GBD 2019).
This dataset provides estimates of worldwide incidence of
various diseases and classes of diseases.
%\footnote{A full list of the diseases and categories can be found in \ref{Appendix1}}
The available measures of incidence include Deaths, Disability Adjusted Life Years (DALYs),
Years of Life Lost (YLL), and Years Lived with Disability (YLD) and come with
both an estimate and 95\% confidence interval bounds.
Estimates are available for national, multinational, and global
populations
(\cite{vos_global_2020}).

These classes of disease are organized in a hierarchy, with each subsuming category
having its own estimates of disease incidence.
One understandable defficiency in this dataset is that it doesn't account for all
diseases tracked in other datasets, but focuses on those
that are most important from a public health perspective.
%not quite sure how to fill this out. What I am hoping to do is to justify my use of
% the highest level (most precise categories of data. Might be better to discuss
% the nested category outline.
The IHME also provides a link between the disease/cause hierarchy and ICD10
codes
(\cite{global_burden_of_disease_collaborative_network_global_2020}).


%----------------------------------------------------
\subsubsection{Medical and Pharmacological Terminologies}\label{datalinks}
In order to link these disparate data sources I used multiple standardized
terminologies.
In each section below I briefly describe each terminology, its contents, and uses.

\paragraph{Medical Subject Headings (MeSH) Thesaurus}

The Medical Subject Headings (MeSH) Thesaurus is produced and maintained by the National
Library of Medicine.
It is used to index subjects in various NLM publications including PubMed
(\cite{noauthor_medical_nodate}).
The AACT database contains a table that links clinical trials' clinical conditions
and drug names to terms in the MeSH thesaurus.
As this contains a standardized nomenclature, it simplified much of the
linking between clinical trials and other datasources.

\paragraph{RxNorm}

According to \cite{noauthor_rxnorm_nodate-1}
\begin{displayquote}
	What is RxNorm? \\
	RxNorm is two things: a normalized naming system for generic and branded drugs;
	and a tool for supporting semantic interoperation between drug terminologies
	and pharmacy knowledge base systems.\dots \\
\end{displayquote}
Both of these functions are crucial to the analysis.
The normalized naming system allowed me to convert a diverse
set of names as recorded for each clinical trial into standardized identifers.
These standardized identifiers are known as RxCUIs, and they are used in RxNorm
to identify not only individual drug components, but also brand names, licensed
drug/dosage pairs, and packages.
The links to other drug terminologies included links to SPL identifiers, which
permitted me to link each trial to drugs on the market at and point in time.

%How did I get and incoprorate this data.
The RxNorm data is provided in multiple formats.
The one I chose to use was a MariaDB database that backs a service called RxNav
provided by the National Library of Medicine (NLM).
The NLM provides scripts to set up and host the backing databases on your
own servers
(\cite{noauthor_rxnav---box_nodate}).
After setting up the local server, I wrote a python program to export
the data from the RxNorm database and import it into the AACT Database.
This was required because the former uses a MariaDB database server
and the latter uses a Postgres database server.

With the data now available alongside the AACT database, I could link trials
to various key drug concepts, including normalized drug ingredient names,
NDCs incorporating those ingredients, and the brand names associated with the NDCs.

\paragraph{International Classification of Diseases 10th revision (ICD-10)}

%what it is
The International Classification of Diseases 10th revision (ICD-10) is a
worldwide standard for categorizing human disease maintained by the
World Health Organization.
Although the WHO version's last major update was in 2019 and it was officially
superceded in 2022 by the 11th revision (\cite{noauthor_international_nodate}),
the 10th revision is still in use in the United States as the
Centers for Medicare and Medicaid Services (CMS) continues to publish
updated versions called
ICD-10-CM (Clinical Managment) (\cite{noauthor_2023_nodate}) and
ICD-10-PCS (Procedure Coding System)(\cite{noauthor_2023_nodate-1}) for use
in medical billing.

ICD-10 codes are organized in a heirarchy.
There are 22 highest level categories, representing general categories such
as cancers, mental illness, and infectious diseases.
The second layer of the hierarchy consists of about 225 2nd level groupings.


%how was it used
The GBD database provided a mapping between their categories and ICD-10
codes (\cite{global_burden_of_disease_collaborative_network_global_2020}).
Unfortunately it appears to use a combination of the default WHO ICD-10 codes
and the ICD-10-CM codes from the CMS.
Additionally, many diseases classified by ICD-10 codes do not correspond to
categories in the GDB database.

%how it was obtained
As I needed a combined list of ICD-10 codes, I first obtained the 2019 version
of the ICD-10-CM codes from the CMS (\cite{noauthor_2019_nodate}).
With the arrival of the ICD-11 system, it was difficult to find an official
source from which to download the WHO versions of ICD-10 codes.
Eventually I resorted to copying them from the navigation bar of the
\href{https://icd.who.int/browse10/2019/en}{official WHO ICD-10 (2019) website}
(\cite{noauthor_icd-10_nodate}.)
After getting both sources into the same format,
I combined them and removed duplicate codes, preferring to keep the descriptions
from the WHO version.
This was done using standard unix scripting commands.
I then imported the data into the Postgres Database alongside the AACT data.

\paragraph{Unified Medical Language System (UMLS) Thesarus}

The NLM also publishes a medical terminology thesaurus
known at the Unified Medical Language System (UMLS) which links terminologies
such as RxNorm, MeSH, and ICD-10.
It is made available through an API hosted by the NLM.
One key feature is the ability to use a basic text search to find matching
terms in various terminologies.


\subsection{Data Integration}\label{dataintegration}
%Goal - Help readers understand which data were used in the analysis
Below is more information about how the data was used in the analysis.

%Describe data pulled from AACT/historical snapshots
% what are snapshots.
% enrollment
% elapsed_duration
% current_status
For clinical trials, I captured each update that occured after the start date
and prior to the primary completion date of the trial.
For clarity I will refer to these as a snapshot of the trial.

For each snapshot I recorded the enrollment (actual or anticipated),
the date the it was submitted, the planned primary completion date,
and the trial's overall status at the time.
I also extracted the anticipated enrollment closest to the actual start date
of the trial, which I will call the planned enrollment under the assumption
that the sponsor is recording their current plan for enrollment.
From these I constructed a couple of normalized values.

The first is a normalized measure of enrollment.
This was constructed by dividing the snapshot enrollment by the planned enrollment.
The purpose of this was to normalize enrollment to a scale roughly around 1
instead of the widely varying counts that raw enrollment would give.
The second was a measure of how far along the trial was
in it's planned duration, in other word a measure of elapsed duration.
This was calculated for each snapshot as:
\begin{align}
	\text{Elapsed Duration} =
	\frac{\text{Snapshot Date} - \text{Start Date}}
	    {\text{Primary Completion Date (anticipated)} - \text{Start Date}}
\end{align}
Note that this has a range of $[0,\infty)$ although for practical
matters it is only about $[0,3]$. %good to put a graph here
I also included the current status by encoding it to dummy parameters.

%Describe linking drugs/getting number of brands
As an initial measure of market conditions I have gathered the number of brands
that are producing drugs containing the compound(s) of interest in the trial.
This was done by extracting the RxCUIs that represented the drugs of interest,
then linking those to the RxCUIs that are brands containing those ingredients.


As a secondary measure of market conditions, I linked clinical trials to the
USP Drug Classification list.
Once I had linked the drugs used in a trial to the applicable USP DC category
and class, I could find the number of alternative brands in that class.
This matching was performed by hand, using a custom web interface to the database.

In order to link clinical trials to standardized ICD-10 conditions and thus
to the Global Burdens of Disease Data, I wrote a python script to search the
UMLS system for ICD-10 codes that matched the MeSH descriptions for
each trial.
This search resulted in generally three categories of search results:
\begin{enumerate}
    \item The results contained a few entries, one of which was obviously correct.
    \item The results contained a large number of entries, a few of which were correct.
    \item The results did not contain any matches.
\end{enumerate}
In these cases I needed a way to validate each match and potentially add my own
ICD-10 codes to each trial.
This matching was also performed by hand, using a separate custom web interface to the database.

The effort to manually match ICD-10 codes and USP DC categories and classes data is ongoing.

%Describe linking icd10 codes to GBD
% Not every icd10 code maps, so some trials are excluded.
%Describe categorizing icd10 codes
After manually matching each trial to an ICD-10 code, each trial is easily linked to
either one of the 22 highest level categories or the 225 or so 2nd level
categories in the ICD-10 hierarchy.
Linking to one of the disease categories in the GBD heirarchy is similarly easy.
To get the best estimate of the size of the population associated with a disease,
each trial is linked to the most specific disease category applicable.
As not every ICD-10 code is linked to a condition in the GBD, those without any
applicable conditions are dropped from the dataset.


\end{document}