The initial loading of the classes takes about 30s
The good news here is that bit only happens when the siteupdate code has changed. The program datcheck.sh runs to build siteupdate (gmake) keeps track of this for us.
When it does need to be recompiled, only the modules that have changed need to be recompiled, so the process should be a lot faster on future runs.
It'll only get near that original 30s again when files (like
Waypoint.h) that are essential building blocks of the program and #included in a lot of different modules have changed.
The less-good news is that Waypoint.h gets updated fairly often; it's where all the datacheck function prototypes are.
It won't make
everything recompile -- just a lot of it.
my first data check using C++ took just 44s, the 2nd one 40s.
Good news here is, we can easily do even better than this!
datacheck.sh and localupdate.sh use siteupdateST, a single-threaded variant. This was done back before getting the standard multi-threaded version to compile & run on FreeBSD.
Using the multi-threaded version will make things even faster.
How many threads to use? Things to consider:
- Where's the sweet spot for running a production site update, before efficiency decreases and total run time increases?
My tests have been using the old-school magnetic HD. Would results be better writing to /fast? - Should we use a slightly smaller number, below some point of diminishing returns, and leave some cores free for noreaster to perform its other duties?
- Where's the sweet spot for running siteupdate --errorcheck? This mode skips nmp_merged writing and subgraph generation, which don't scale well. Meaning siteupdate --errorcheck should more gracefully scale to a larger number of threads.
- Should we use a lower number of threads anyway, to avoid hogging all noreaster's resources? How many contributors are likely to run datacheck.sh simultaneously?
@yakra + @Jim: Will you maintain the Python version in the future
In the short term, I'd like to keep the two versions aligned. But once I get a little more familiar with the C++ code and we've been using it for a while with no problems, we'll probably retire the Python version.
My thoughts as well. I'll continue to maintain the Python version, for a while at least. It'll be good to have it around as a fallback in case something goes wrong.
should all (hwy data managers) switch to C++?
Yes!
- It's way way faster!
- It will be the production version of the code, with Python to be retired at some point.
- Gimme bug reports! Hopefully bugs should be few and far between, but one never knows. I test out the program when updating it, but usually with known good commits, and/or whatever's in the master HighwayData branch at the time. There's always the possibility that with just the right dataset, with something that fails (or even should pass) datacheck, something unexpected could occur that I wouldn't happen to observe on my own. The more eyes on it the better.