Author Topic: Site update now using C++ version (Read 24295 times)

Jim · « **on:** December 28, 2022, 10:10:42 pm »

Tonight's production site update was the first done with the C++ site update program written by @yakra. It's much more efficient than the Python version I developed initially, and which he has extended and improved greatly over the last few years, in parallel with the C++ development.

First, thanks to @yakra for taking this on and getting it all to this point.

Second, there's of course some chance that bugs will show up. If you notice anything that looks like it's not quite right, let us know.

yakra · « **Reply #1 on:** December 29, 2022, 03:45:49 pm »

For those who want to try out the C++ datacheck.sh on noreaster, just change your directory to
siteupdate/cplusplus instead of siteupdate/python-teresco.
Then, type sh datacheck.sh as you normally would.

michih · « **Reply #2 on:** December 29, 2022, 04:07:13 pm »

Thanks! The data check is very, very quick! The initial loading of the classes takes about 30s but my first data check using C++ took just 44s, the 2nd one 40s. That helps a lot when we have to fix errors

@yakra + @Jim: Will you maintain the Python version in the future, or should all (hwy data managers) switch to C++?
Nevertheless, the Github docu needs to be updated. It's linked from devel.php.

Edit: I simply run:

Code: [Select]

cd ~/DataProcessing/siteupdate/cplusplus
git pull
sh datacheck.sh

Jim · « **Reply #3 on:** December 29, 2022, 07:03:21 pm »

In the short term, I'd like to keep the two versions aligned. But once I get a little more familiar with the C++ code and we've been using it for a while with no problems, we'll probably retire the Python version.

Jim · « **Reply #4 on:** December 29, 2022, 07:10:31 pm »

Also: https://github.com/TravelMapping/DataProcessing/issues/548

yakra · « **Reply #5 on:** December 29, 2022, 10:56:47 pm »

Quote from: michih on December 29, 2022, 04:07:13 pm

The initial loading of the classes takes about 30s

The good news here is that bit only happens when the siteupdate code has changed. The program datcheck.sh runs to build siteupdate (gmake) keeps track of this for us.
When it does need to be recompiled, only the modules that have changed need to be recompiled, so the process should be a lot faster on future runs.
It'll only get near that original 30s again when files (like Waypoint.h) that are essential building blocks of the program and #included in a lot of different modules have changed.
The less-good news is that Waypoint.h gets updated fairly often; it's where all the datacheck function prototypes are.

It won't make everything recompile -- just a lot of it.

Quote from: michih on December 29, 2022, 04:07:13 pm

my first data check using C++ took just 44s, the 2nd one 40s.

Good news here is, we can easily do even better than this!
datacheck.sh and localupdate.sh use siteupdateST, a single-threaded variant. This was done back before getting the standard multi-threaded version to compile & run on FreeBSD.
Using the multi-threaded version will make things even faster.

How many threads to use? Things to consider:

Where's the sweet spot for running a production site update, before efficiency decreases and total run time increases?
My tests have been using the old-school magnetic HD. Would results be better writing to /fast?
Should we use a slightly smaller number, below some point of diminishing returns, and leave some cores free for noreaster to perform its other duties?
Where's the sweet spot for running siteupdate --errorcheck? This mode skips nmp_merged writing and subgraph generation, which don't scale well. Meaning siteupdate --errorcheck should more gracefully scale to a larger number of threads.
Should we use a lower number of threads anyway, to avoid hogging all noreaster's resources? How many contributors are likely to run datacheck.sh simultaneously?

Quote from: michih on December 29, 2022, 04:07:13 pm

@yakra + @Jim: Will you maintain the Python version in the future

Quote from: Jim on December 29, 2022, 07:03:21 pm

In the short term, I'd like to keep the two versions aligned. But once I get a little more familiar with the C++ code and we've been using it for a while with no problems, we'll probably retire the Python version.

My thoughts as well. I'll continue to maintain the Python version, for a while at least. It'll be good to have it around as a fallback in case something goes wrong.

Quote from: michih on December 29, 2022, 04:07:13 pm

should all (hwy data managers) switch to C++?

Yes!

It's way way faster!
It will be the production version of the code, with Python to be retired at some point.
Gimme bug reports! Hopefully bugs should be few and far between, but one never knows. I test out the program when updating it, but usually with known good commits, and/or whatever's in the master HighwayData branch at the time. There's always the possibility that with just the right dataset, with something that fails (or even should pass) datacheck, something unexpected could occur that I wouldn't happen to observe on my own. The more eyes on it the better.

michih · « **Reply #6 on:** December 30, 2022, 02:20:23 am »

Quote from: yakra on December 29, 2022, 10:56:47 pm

Quote from: michih on December 29, 2022, 04:07:13 pm
should all (hwy data managers) switch to C++?
Yes!
It's way way faster!
It will be the production version of the code, with Python to be retired at some point.
Gimme bug reports! Hopefully bugs should be few and far between, but one never knows. I test out the program when updating it, but usually with known good commits, and/or whatever's in the master HighwayData branch at the time. There's always the possibility that with just the right dataset, with something that fails (or even should pass) datacheck, something unexpected could occur that I wouldn't happen to observe on my own. The more eyes on it the better.

Highway data managers should run data verification (Run site update program to generate the same logs, stats, and database file that are produced as part of the regular site update process ) before submitting a pull request to avoid that the production site update fails and a) Jim has to deal with the errors or b) the site will not be updated that day or c) the site is "down" for hours till Jim is facing the issue.
Thus, I think it is essential that all highway data managers switch to the C++ version with their next pull request because the code of Python and C++ version is not 100% the same. A successful data check with Python has not relevance / advantage anymore since the production site update is run with the C++ version now. We only have to change one command.

Quote from: yakra on December 29, 2022, 03:45:49 pm

For those who want to try out the C++ datacheck.sh on noreaster, just change your directory to
siteupdate/cplusplus instead of siteupdate/python-teresco.

Quote from: michih on December 29, 2022, 04:07:13 pm

I simply run:
Code: [Select]
cd ~/DataProcessing/siteupdate/cplusplus git pull sh datacheck.sh

If bugs arise, so what. They will also arise on the production server.

yakra · « **Reply #7 on:** December 30, 2022, 08:18:51 pm »

Topic split: Crash/core dump and optimal number of threads

nezinscot · « **Reply #8 on:** January 19, 2023, 09:28:21 am »

I tried using the cplusplus version of datacheck this morning and received this error:

[xxx ~]$ cd DataProcessing/siteupdate/cplusplus
[xxx ~/DataProcessing/siteupdate/cplusplus]$ sh datacheck.sh

datacheck.sh: updating TM repositories
Already up to date.
Already up to date.
datacheck.sh: creating directories
datacheck.sh: Building latest site update program...
gmake: 'siteupdateST' is up to date.
datacheck.sh: launching siteupdateST
ld-elf.so.1: /usr/local/lib/gcc10/libstdc++.so.6: version GLIBCXX_3.4.29 required by /home/xxx/DataProcessing/siteupdate/cplusplus/siteupdateST not found

My sandbox seems to have an obsolete (or no) version of GLIBCXX. Do I need to recreate my sandbox, or did I miss a step?

yakra · « **Reply #9 on:** January 19, 2023, 10:23:32 am »

You have an old revision of the DataProcessing repo. Newer revisions build the siteupdate program with a different compiler.

[xxx ~]$ cd DataProcessing/siteupdate/cplusplus
[xxx ~/DataProcessing/siteupdate/cplusplus]$ git pull
[xxx ~/DataProcessing/siteupdate/cplusplus]$ gmake clean #(you'll only need to do this once)
[xxx ~/DataProcessing/siteupdate/cplusplus]$ sh datacheck.sh

(Jim, speaking of gmake clean: https://github.com/TravelMapping/DataProcessing/pull/569)

As an aside, it's weird to see the GLIBCXX error with siteupdateST. I thought that was only a thing with the multithreaded version. Either way, this will be fixed by recompiling with clang instead of GCC.

Travel Mapping

News:

Author Topic: Site update now using C++ version (Read 24295 times)

Jim

Site update now using C++ version

yakra

Re: Site update now using C++ version

michih

Re: Site update now using C++ version

Jim

Re: Site update now using C++ version

Jim

Re: Site update now using C++ version

yakra

Re: Site update now using C++ version

michih

Re: Site update now using C++ version

yakra

Re: Site update now using C++ version

nezinscot

Re: Site update now using C++ version

yakra

Re: Site update now using C++ version