Build historical data html parser
#15
Closed
opened 4 years ago by youainti
·
9 comments
Loading…
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
Build a parser to extract data from the historical html queries.
Steps include
For simplicity, i'll be using Beautiful Soup
So the structure is something like
So what I think I need to do is
Would this be easier in Rust or Python?
Rust
Python
Going forward with python is probably the best path. I want to use rust though.
Regardless of tool, I will need to:
71e87a9started text processing tools, is a separate file for testing purposes.
Started putting together which tables will be needed.
Started designing the data extraction pipeline:
as of
b1c146d55I have a mostly working parser.As of
9d5a726I have sql for the history table and added start_date to the tracked data.Database integrations working as of
ee3e37e.There are still issues with the parsing.
It appears that the html format changed in the last 7 months.
class="SBSCell" contains the data from the side by side versions.
Got the data parsing updated to newest format in
3eb9a41.Ran into a new issue with NCT00789633 where the overall status includes information as to why it was suspended. This needs adjustment to the overall status parsing code (line 164ish)
As of
fc38a2eI have fixed the issue with the overall status containing notes and have written a justfile recipe that will download and parse the data of interest.