Author Topic: Using grep to search highway data, waypoint labels, etc.  (Read 301 times)

0 Members and 1 Guest are viewing this topic.

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2245
  • Last Login:October 18, 2019, 03:13:53 pm
Using grep to search highway data, waypoint labels, etc.
« on: October 04, 2019, 04:41:33 pm »
grep is a powerful tool to search for regular expressions in text files. It runs on Unix-like systems such as Linux, FreeBSD (e.g. noreaster) or Macintosh. With a little practice, you can put together expressions to search for very specific types of text strings. In this post, I'll focus on waypoint labels.

Commands were executed from the hwy_data/ directory.
They're designed to easily substitute different regions into the $rg variable, so you can execute the same command different times to search different regions.
grep whatever $rg/*/*.wpt
grep whatever $rg/*/*.wpt

et cetera. You can also wrap them up inside loops to search multiple regions in one go:
for rg in PA DE; do grep whatever $rg/*/*.wpt; done
MyRegions='PA DE'
for rg in $MyRegions; do grep whatever $rg/*/*.wpt; done

et cetera.

The sed command was included to strip out AltLabels and URLs from the .wpt file lines, to ensure that we're only searching within primary labels.
If you have questions about the various tokens, expressions or commands, feel free to ask.

Some example search results in Pennsylvania and Delaware are posted here.

Potential no-underscore malformed city suffixes
grep '^[A-Z][A-Z][0-9]\{1,3\}[A-Z][a-z][a-z]' $rg/*/*.wpt | grep -v '[A-Z][A-Z][0-9]\{1,3\}Alt\|[A-Z][A-Z][0-9]\{1,3\}Bus\|[A-Z][A-Z][0-9]\{1,3\}Byp\|[A-Z][A-Z][0-9]\{1,3\}Trk\|[A-Z][A-Z][0-9]\{1,3\}His\|[A-Z][A-Z][0-9]\{1,3\}Spr\|[A-Z][A-Z][0-9]\{1,3\}Con'
This one can get confusing.

Intersections with visibly numbered highways
US40: US40BusWhi | US73: US40BusWhi US40BusTho | A3: A3Zur
Distinguish two different same-bannered same-numbered routes as needed with the 3-letter city abbreviations. Also use the city abbreviation for bannerless same-designation spurs or branches, such as the Zurich A3 spur intersecting the main A3.
While we do want to use city abbreviations for routes (in the HB) that have them...

Distinguishing otherwise identical waypoints (not for exit numbers)
US95: NV57_Ren | NV57_Tah
If more than two points for the same non-exit-numbered cross road are needed, there are two options which can be used in combination with or ignoring the previous options for pairs of identical labels.
1. Use alphabetical suffixes _A, _B, _C, etc.
2. Choose 3-letter suffixes for a nearby towns if they are fairly close. The 3-letter suffix should be the first 3 letters of the town name. Add a suffix with an underscore and those 3 letters.
...disambiguating multiple intersections with a single route, or with roads not in the HB, is done with an underscore.
A lot of time, people may use the former when they intended to use the latter.

Underscored 4-character city suffixes not ending in N, E, W, or S
grep -s '_...[A-DF-MO-RTUVXYZa-z] ' $rg/*/*.wpt
Distinguishing otherwise identical waypoints (not for exit numbers)
NV88_PitS | NV88_PitN
If you need the same town twice, add a 4th letter that is a direction letter (_PitS and _PitN for southern and northern junctions near Pittston). If the town suffixes are not useful, are confusing, or require further elaboration (3+ junctions with same town), use the alphabetical suffixes instead.

2-character city suffixes
grep '_.. ' $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep -v ':+' | grep '_..$'

Old highway designation labels
grep '^Old..[0-9]' $rg/*/*.wpt
Waypoints for roads that no longer have a name or no longer exist as a road
If the old highway has a posted name or number, use that name or number (don't make one up), and label the waypoint according to the usual rules.
If US 30/Main Street becomes Main Street, then the label is MainSt, not something like OldUS30.

If the old highway has a posted name that mentions the old designation, then applying the above rule will result in a label like OldUS40 for "Old Route 40" if the route was formerly US 40.
There will be a good number of false-positive, legitimate labels here.
I'm not sure how closely these rules (if it's a requirement & not an option) have been followed; there are a lot of examples where "OldRt" or "OldHwy", etc. are used if a road is signed as such in the field, or if that's the official name. (Although, Tim may have done it that way before these specific guidelines were codified...) May be something worth discussing in the forum.

Long words, >4 characters
grep . $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep -v ':+' | grep ':.*[a-z]\{4\}'
Intersections with named highways
FaiRd PapMillRd
For up to two other (specifying) words, truncate the word as follows:
1-4 letters - use whole word
5+ letters - use the first 3 letters. Don't use a made-up abbreviation. Fairchild Road becomes FaiRd, not FrchldRd or FchRd or anything else.

Too many words
grep -s . $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep -v '[NPS]P$\|:[A-Z]*[a-z]*Mc[A-Z][A-Z][a-z]*$\|:Mc[A-Z][A-Z]\+[a-z]*[A-Z][a-z]*$\|:[A-Z]*Mc[A-Z]\+[a-z]\?$' | grep '[A-Z]\+[a-z]\+[A-Z]\+[a-z]\+[A-Z]\+[a-z]\+[A-Z]\+[a-z]*\|[A-Z]\+[a-z]\+[A-Z]\+[a-z]\+[A-Z]\+[a-z]\?[A-Z]\+[a-z]\+\|[A-Z]\+[a-z]\+[A-Z]\+[a-z]\?[A-Z]\+[a-z]\+[A-Z]\+[a-z]*\|[A-Z]\+[a-z]\?[A-Z]\+[a-z]\+[A-Z]\+[a-z]\+[A-Z]\+[a-z]*'
Intersections with named highways
MarKingBlvd MLKingBlvd
If the cross road name has more than 3 words, use one of two options:
1. Pick out the two most important words besides the road type and use only those: Martin Luther King Boulevard becomes MarKingBlvd. Three words in total are included in shortened form.
2. Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.
An older style for 3-words road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd.
This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.

grep . $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep 'Mc[A-Z][a-z]'
In the above search, the Mc[A-Z] tokens are included to filter out "Old McDonald Farm Rd" style false positives.
OTOH, McDonald is one word, so it makes sense to truncate it to McD, rather than treat it as 2 for McDon.

Tpke for Turnpike
grep -s Tpke $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep 'Tpke'
CHM standardized on "Tpk" for turnpike, standard abbreviations notwithstanding...

Directional prefixes
grep '^[NEWS]' $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep ':[NEWS]' | grep -v ":+\|:.[a-z]\|:[A-Z][A-Z]/[A-Z][A-Z]$\|$rg[0-9]\{1,3\}\|:[NEWS]End$"
Intersections with named highways
6thSt | 33rdAve | SeeLn
Ignore any non-essential direction specifier. N. 6th St becomes 6thSt. 33rd Avenue SW becomes 33rdAve. W. Seedy Lane becomes SeeLn.

NorPkwy | SouBlvd
But keep directions that are the main part of the road name, such as NorPkwy for Northern Parkway or SouBlvd for Southeast Boulevard.
What's a "non-essential direction specifier" and what are "directions that are the main part of the road name" can get wibbly-wobbly at times; use your best judgment. Many of these may be legit, but they're worth checking out.
IMO though, N/E/W/S are common enough abbreviations for their respective directions, and there are enough examples of this throughout the data, that I'm not going to call for every WPondRd to be changed to a WestPondRd.

Similarly, a search for suffixed directional specifiers:
grep '[NEWS]' $rg/*/*.wpt | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep '[NEWS]$' | grep -v ':+\|_\|[0-9]\{1,3\}[NEWS]$\|:[A-Z][A-Z]/[A-Z][A-Z]$\|:CR[A-Z]$\|:Rd[A-Z]$\|:Ave[A-Z]$\|:I\-[0-9]\{1,3\}BS$'

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2245
  • Last Login:October 18, 2019, 03:13:53 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #1 on: October 06, 2019, 01:12:13 pm »
Rte(Number) style labels on US Interstates
for file in */usai/*.wpt; do results=`grep -H . $file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep 'I-[0-9]\+(.\+)' | wc -l`; if [ $results -gt 2 ]; then grep -H . $file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' | grep 'I-[0-9]\+(.\+)'; fi; done
One point of confusion can be the two styles of waypoint labeling for concurrencies with Interstate Highways and other exit numbered routes.

Interchanges on exit-numbered highways
I-80: 56 | 87(75) | 89(75) | 63
In multiplexes where the concurrency uses exit numbers from the other highway, put the highway number in parentheses. Drop the letter prefix of the concurrent highway if it is more than one character long: I-75 becomes (75). A5 can stay as (A5).
The vast majority of Interstates (with rare exceptions, most obviously WY I-180) will be exit-numbered routes, and thus fall into this category.

Waypoint labels for multiplexes
US90: LA76 | I-49(45) | I-49(47) | I-49(52) | US40
For non-exit-numbered routes concurrent with a numbered, exit-numbered route, use the concurrent highway designation with the exit numbers in parentheses.
We'll sometimes see this erroneously; this basic (Interstate-to-Interstate only) search is designed to pick up on these cases.
It only prints results for routes with at least 3 such labels: Some routes, such as NY I-490 or ME I-295, have both ends at unnumbered interchanges with the parent route, necessitating this style label. Some routes, such as TX I-69, are not fully exit-numbered yet, and contain some crossroad-style labels.
Right now, it reports some malformed labels in ms.i055.wpt & wi.i041.wpt.

How far down this rabbit hole can we go?
An experiment. A somewhat unwieldy one...
rm -f exit_numbers.log looks_unnumbered.log looks_numbered.log
touch exit_numbers.log looks_unnumbered.log looks_numbered.log
for rg in `ls -l | grep ^d | sed 's/.* \(.\+\)$/\1/'`; do
  for sys in `ls -l $rg | grep ^d | sed 's/.* \(.\+\)$/\1/'`; do
    for file in `ls $rg/$sys | grep '\.wpt$'`; do
      # This gets messed up by files with spaces in their names, and ignores them.
      if [ -r $rg/$sys/$file ]; then
        LooksNumbered=`sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' $rg/$sys/$file | grep -c -m 1 '^[0-9]\+$'`
        if [ $LooksNumbered == 0 ]; then
          results=`grep . $rg/$sys/$file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' \
                   | grep '^\*\?[0-9]\+[A-Z]\?([0-9A-Za-z]\+)' | wc -l`
          if (( $results != 0 && $results != `cat $rg/$sys/$file | wc -l` )); then
            echo -e "\n$rg/$sys/$file looks like a non-exit-numbered route" \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
            grep . $rg/$sys/$file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' \
            | grep '^\*\?[0-9]\+[A-Z]\?([0-9A-Za-z]\+)' \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
        if [ $LooksNumbered == 1 ]; then
          results=`grep . $rg/$sys/$file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' \
                   | grep '^\*\?[A-Z]\+[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' | wc -l`
          if [ $results -gt 2 ]; then
            echo -e "\n$rg/$sys/$file looks like an exit-numbered highway" \
            | tee -a exit_numbers.log | tee -a looks_numbered.log
            grep . $rg/$sys/$file | sed -e 's/ /%/' -e 's/\(.\+\)%.\+/\1/' \
            | grep '^\*\?[A-Z]\+[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' \
            | tee -a exit_numbers.log | tee -a looks_numbered.log

Works on my Linux machines, but not noreaster, for reasons I've not looked into.
looks numbered:  flags 624 labels   in 60 routes  in 27 regions  maintained by 13 contributors
looks unnumbered: flags 14143 labels in 752 routes in 105 regions maintained by 8 contributors
combined:        flags 14767 labels in 812 routes in 115 regions maintained by 14 contributors
...Back away slowly.
« Last Edit: October 06, 2019, 01:20:01 pm by yakra »