Build historical data html parser #15

Build a parser to extract data from the historical html queries.

Steps include

Decide on language (python or rust)
Implement basic parsing.
Benchmark
If it will take longer than 48 hours, implement parallelization (worker pool approach).

Build a parser to extract data from the historical html queries. Steps include - [x] Decide on language (python or rust) - [x] Implement basic parsing. - [x] Benchmark - [x] If it will take longer than 48 hours, implement parallelization (worker pool approach).

For simplicity, i'll be using Beautiful Soup

So the structure is something like

<table>
    <tr>
        <form id="identifier">
            <table>
                Data of interest in rows
            </table>
        </form>
    </tr>
    <tr>
        <form id="identifier2">
            <table>
                Data of interest in rows
            </table>
        </form>
    </tr>
</table>

So what I think I need to do is

extract the table containing the forms
match on form identifiers
- identify form rows of interest
- pass those to be processed (need a way to extract new vs old data)
Get a datastructure and functions that can be used to push data to the database

Would this be easier in Rust or Python?

Rust

Match etc are nice and clean
structs, traits, and enums
Will probably take me longer to write
For long term maintenance, this is probably best.

Python

Beautiful soup
Familiar
Can use dict with functions for mapping.
Consistent with other code.

Going forward with python is probably the best path. I want to use rust though.

Regardless of tool, I will need to:

Develop tables to hold the information I want.
Develop classes with methods to upload the data.
Develop functions to extract the data I want as instantiated classes.
Develop program to iterate through, matching sections to functions.

So the structure is something like ```HTML <table> <tr> <form id="identifier"> <table> Data of interest in rows </table> </form> </tr> <tr> <form id="identifier2"> <table> Data of interest in rows </table> </form> </tr> </table> ``` So what I think I need to do is 1. extract the table containing the forms 1. match on form identifiers - identify form rows of interest - pass those to be processed (*need a way to extract new vs old data*) 1. Get a datastructure and functions that can be used to push data to the database Would this be easier in Rust or Python? Rust - Match etc are nice and clean - structs, traits, and enums - Will probably take me longer to write - For long term maintenance, this is probably best. Python - Beautiful soup - Familiar - Can use dict with functions for mapping. - Consistent with other code. Going forward with python is probably the best path. I want to use rust though. Regardless of tool, I will need to: - [x] Develop tables to hold the information I want. - [x] Develop classes with methods to upload the data. - [x] Develop functions to extract the data I want as instantiated classes. - [x] Develop program to iterate through, matching sections to functions.

71e87a9

started text processing tools, is a separate file for testing purposes.

Started putting together which tables will be needed.

Started designing the data extraction pipeline:

Iterate through forms, extracting the data as needed
store the returned objects (in a dictionary?)
iterate through objects, constructing an insert querey
insert data, returning primary key
using primary key, go back through list, inserting into other tables as needed.

71e87a9 started text processing tools, is a separate file for testing purposes. Started putting together which tables will be needed. Started designing the data extraction pipeline: - Iterate through forms, extracting the data as needed - store the returned objects (in a dictionary?) - iterate through objects, constructing an insert querey - insert data, returning primary key - using primary key, go back through list, inserting into other tables as needed.

as of b1c146d55 I have a mostly working parser.
As of 9d5a726 I have sql for the history table and added start_date to the tracked data.

as of b1c146d55 I have a mostly working parser. As of 9d5a726 I have sql for the history table and added start_date to the tracked data.

IMplemement the version pulling from the DB

- [x] IMplemement the version pulling from the DB

Database integrations working as of ee3e37e.

There are still issues with the parsing.

Database integrations working as of ee3e37e. There are still issues with the parsing.

It appears that the html format changed in the last 7 months.

<tr>
    <td class="rowLabel" style="min-width: 210px;">Study Completion:</td>
    <td class="SBSCell">November 2011 [Anticipated]       </td>
    <td class="SBSCell">November 2011 [Anticipated]       </td>
</tr> 
<tr>
    <td class="rowLabel" style="min-width: 210px;">Last Update Submitted that<br/>Met QC Criteria:    </td>
    <td class="SBSCell">November <span class="drop_hilite">12</span>, 2008      </td>
    <td class="SBSCell">November <span class="add_hilite">27</span>, 2008      </td>
</tr>

class="SBSCell" contains the data from the side by side versions.

It appears that the html format changed in the last 7 months. ```html <tr> <td class="rowLabel" style="min-width: 210px;">Study Completion:</td> <td class="SBSCell">November 2011 [Anticipated] </td> <td class="SBSCell">November 2011 [Anticipated] </td> </tr> <tr> <td class="rowLabel" style="min-width: 210px;">Last Update Submitted that<br/>Met QC Criteria: </td> <td class="SBSCell">November <span class="drop_hilite">12</span>, 2008 </td> <td class="SBSCell">November <span class="add_hilite">27</span>, 2008 </td> </tr> ``` class="SBSCell" contains the data from the side by side versions.

Got the data parsing updated to newest format in 3eb9a41.

Ran into a new issue with NCT00789633 where the overall status includes information as to why it was suspended. This needs adjustment to the overall status parsing code (line 164ish)

Got the data parsing updated to newest format in 3eb9a41. Ran into a new issue with [NCT00789633](https://clinicaltrials.gov/ct2/history/NCT00789633?A=4&B=5&C=Side-by-Side) where the overall status includes information as to why it was suspended. This needs adjustment to the overall status parsing code (line 164ish)

As of fc38a2e I have fixed the issue with the overall status containing notes and have written a justfile recipe that will download and parse the data of interest.

As of fc38a2e I have fixed the issue with the overall status containing notes and have written a justfile recipe that will download and parse the data of interest.

Labels Milestones

Build historical data html parser #15