Build historical data html parser #15

Closed
opened 4 years ago by youainti · 9 comments
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)

Build a parser to extract data from the historical html queries.

Steps include

  • Decide on language (python or rust)
  • Implement basic parsing.
  • Benchmark
  • If it will take longer than 48 hours, implement parallelization (worker pool approach).
Build a parser to extract data from the historical html queries. Steps include - [x] Decide on language (python or rust) - [x] Implement basic parsing. - [x] Benchmark - [x] If it will take longer than 48 hours, implement parallelization (worker pool approach).
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)

For simplicity, i'll be using Beautiful Soup

For simplicity, i'll be using Beautiful Soup
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)

So the structure is something like

<table>
    <tr>
        <form id="identifier">
            <table>
                Data of interest in rows
            </table>
        </form>
    </tr>
    <tr>
        <form id="identifier2">
            <table>
                Data of interest in rows
            </table>
        </form>
    </tr>
</table>

So what I think I need to do is

  1. extract the table containing the forms
  2. match on form identifiers
    • identify form rows of interest
    • pass those to be processed (need a way to extract new vs old data)
  3. Get a datastructure and functions that can be used to push data to the database

Would this be easier in Rust or Python?

Rust

  • Match etc are nice and clean
  • structs, traits, and enums
  • Will probably take me longer to write
  • For long term maintenance, this is probably best.

Python

  • Beautiful soup
  • Familiar
  • Can use dict with functions for mapping.
  • Consistent with other code.

Going forward with python is probably the best path. I want to use rust though.

Regardless of tool, I will need to:

  • Develop tables to hold the information I want.
  • Develop classes with methods to upload the data.
  • Develop functions to extract the data I want as instantiated classes.
  • Develop program to iterate through, matching sections to functions.
So the structure is something like ```HTML <table> <tr> <form id="identifier"> <table> Data of interest in rows </table> </form> </tr> <tr> <form id="identifier2"> <table> Data of interest in rows </table> </form> </tr> </table> ``` So what I think I need to do is 1. extract the table containing the forms 1. match on form identifiers - identify form rows of interest - pass those to be processed (*need a way to extract new vs old data*) 1. Get a datastructure and functions that can be used to push data to the database Would this be easier in Rust or Python? Rust - Match etc are nice and clean - structs, traits, and enums - Will probably take me longer to write - For long term maintenance, this is probably best. Python - Beautiful soup - Familiar - Can use dict with functions for mapping. - Consistent with other code. Going forward with python is probably the best path. I want to use rust though. Regardless of tool, I will need to: - [x] Develop tables to hold the information I want. - [x] Develop classes with methods to upload the data. - [x] Develop functions to extract the data I want as instantiated classes. - [x] Develop program to iterate through, matching sections to functions.
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)

71e87a9

started text processing tools, is a separate file for testing purposes.

Started putting together which tables will be needed.

Started designing the data extraction pipeline:

  • Iterate through forms, extracting the data as needed
  • store the returned objects (in a dictionary?)
  • iterate through objects, constructing an insert querey
  • insert data, returning primary key
  • using primary key, go back through list, inserting into other tables as needed.
71e87a9 started text processing tools, is a separate file for testing purposes. Started putting together which tables will be needed. Started designing the data extraction pipeline: - Iterate through forms, extracting the data as needed - store the returned objects (in a dictionary?) - iterate through objects, constructing an insert querey - insert data, returning primary key - using primary key, go back through list, inserting into other tables as needed.
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)

as of b1c146d55 I have a mostly working parser.
As of 9d5a726 I have sql for the history table and added start_date to the tracked data.

as of b1c146d55 I have a mostly working parser. As of 9d5a726 I have sql for the history table and added start_date to the tracked data.
youainti commented 4 years ago (Migrated from gitea.kgjk.icu)
  • IMplemement the version pulling from the DB
- [x] IMplemement the version pulling from the DB
youainti commented 3 years ago (Migrated from gitea.kgjk.icu)

Database integrations working as of ee3e37e.

There are still issues with the parsing.

Database integrations working as of ee3e37e. There are still issues with the parsing.
youainti commented 3 years ago (Migrated from gitea.kgjk.icu)

It appears that the html format changed in the last 7 months.

<tr>
    <td class="rowLabel" style="min-width: 210px;">Study Completion:</td>
    <td class="SBSCell">November 2011 [Anticipated]       </td>
    <td class="SBSCell">November 2011 [Anticipated]       </td>
</tr> 
<tr>
    <td class="rowLabel" style="min-width: 210px;">Last Update Submitted that<br/>Met QC Criteria:    </td>
    <td class="SBSCell">November <span class="drop_hilite">12</span>, 2008      </td>
    <td class="SBSCell">November <span class="add_hilite">27</span>, 2008      </td>
</tr>  

class="SBSCell" contains the data from the side by side versions.

It appears that the html format changed in the last 7 months. ```html <tr> <td class="rowLabel" style="min-width: 210px;">Study Completion:</td> <td class="SBSCell">November 2011 [Anticipated] </td> <td class="SBSCell">November 2011 [Anticipated] </td> </tr> <tr> <td class="rowLabel" style="min-width: 210px;">Last Update Submitted that<br/>Met QC Criteria: </td> <td class="SBSCell">November <span class="drop_hilite">12</span>, 2008 </td> <td class="SBSCell">November <span class="add_hilite">27</span>, 2008 </td> </tr> ``` class="SBSCell" contains the data from the side by side versions.
youainti commented 3 years ago (Migrated from gitea.kgjk.icu)

Got the data parsing updated to newest format in 3eb9a41.

Ran into a new issue with NCT00789633 where the overall status includes information as to why it was suspended. This needs adjustment to the overall status parsing code (line 164ish)

Got the data parsing updated to newest format in 3eb9a41. Ran into a new issue with [NCT00789633](https://clinicaltrials.gov/ct2/history/NCT00789633?A=4&B=5&C=Side-by-Side) where the overall status includes information as to why it was suspended. This needs adjustment to the overall status parsing code (line 164ish)
youainti commented 3 years ago (Migrated from gitea.kgjk.icu)

As of fc38a2e I have fixed the issue with the overall status containing notes and have written a justfile recipe that will download and parse the data of interest.

As of fc38a2e I have fixed the issue with the overall status containing notes and have written a justfile recipe that will download and parse the data of interest.
Sign in to join this conversation.
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: youainti/ClinicalTrialsDataProcessing#15
Loading…
There is no content yet.