Author Topic: Using grep to search highway data, waypoint labels, etc.  (Read 1572 times)

0 Members and 1 Guest are viewing this topic.

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2373
  • Last Login:Yesterday at 11:29:13 pm
Using grep to search highway data, waypoint labels, etc.
« on: October 04, 2019, 04:41:33 pm »
grep is a powerful tool to search for regular expressions in text files. It runs on Unix-like systems such as Linux, FreeBSD (e.g. noreaster) or Macintosh. With a little practice, you can put together expressions to search for very specific types of text strings. In this post, I'll focus on waypoint labels.

Commands were executed from the hwy_data/ directory.
They're designed to easily substitute different regions into the $rg variable, so you can execute the same command different times to search different regions.
rg=PA
grep whatever $rg/*/*.wpt
rg=DE
grep whatever $rg/*/*.wpt

et cetera. You can also wrap them up inside loops to search multiple regions in one go:
for rg in PA DE; do grep whatever $rg/*/*.wpt; done
or
MyRegions='PA DE'
for rg in $MyRegions; do grep whatever $rg/*/*.wpt; done

et cetera.

The cut command was included to strip out AltLabels and URLs from the .wpt file lines, to ensure that we're only searching within primary labels.
If you have questions about the various tokens, expressions or commands, feel free to ask.

Some example search results in Pennsylvania and Delaware are posted here.



Potential no-underscore malformed city suffixes
grep -v '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep -v '[A-Z][A-Z][0-9]{1,3}Alt|[A-Z][A-Z][0-9]{1,3}Bus|[A-Z][A-Z][0-9]{1,3}Byp|[A-Z][A-Z][0-9]{1,3}Trk|[A-Z][A-Z][0-9]{1,3}His|[A-Z][A-Z][0-9]{1,3}Spr|[A-Z][A-Z][0-9]{1,3}Con' | egrep ':[A-Z]{2}[0-9]{1,3}[A-Z][a-z]{2}'
This one can get confusing.
Edit, 2019-11-18 01:32 EST: Only check primary labels; avoid false negatives.

Intersections with visibly numbered highways
Quote
US40: US40BusWhi | US73: US40BusWhi US40BusTho | A3: A3Zur
Distinguish two different same-bannered same-numbered routes as needed with the 3-letter city abbreviations. Also use the city abbreviation for bannerless same-designation spurs or branches, such as the Zurich A3 spur intersecting the main A3.
While we do want to use city abbreviations for routes (in the HB) that have them...

Distinguishing otherwise identical waypoints (not for exit numbers)
Quote
US95: NV57_Ren | NV57_Tah
If more than two points for the same non-exit-numbered cross road are needed, there are two options which can be used in combination with or ignoring the previous options for pairs of identical labels.
1. Use alphabetical suffixes _A, _B, _C, etc.
2. Choose 3-letter suffixes for a nearby towns if they are fairly close. The 3-letter suffix should be the first 3 letters of the town name. Add a suffix with an underscore and those 3 letters.
...disambiguating multiple intersections with a single route, or with roads not in the HB, is done with an underscore.
A lot of time, people may use the former when they intended to use the latter.



Underscored 4-character city suffixes not ending in N, E, W, or S
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep '_...[A-DF-MO-RTUVXYZa-z]$'
Distinguishing otherwise identical waypoints (not for exit numbers)
Quote
NV88_PitS | NV88_PitN
If you need the same town twice, add a 4th letter that is a direction letter (_PitS and _PitN for southern and northern junctions near Pittston). If the town suffixes are not useful, are confusing, or require further elaboration (3+ junctions with same town), use the alphabetical suffixes instead.
Edit, 2019-11-17 20:58 EST: Only check primary labels.



2-character city suffixes
egrep -sv '^\+' $rg/*/*.wpt | cut -f1 -d' ' | grep -v '_U[0-9]$' | grep '_..$'
Edit, 2019-11-17 22:08 EST: Exclude false positives caused by numbered U-turn ramps.



Old highway designation labels
grep -s '^Old..[0-9]' $rg/*/*.wpt
Waypoints for roads that no longer have a name or no longer exist as a road
Quote
MainSt
If the old highway has a posted name or number, use that name or number (don't make one up), and label the waypoint according to the usual rules.
If US 30/Main Street becomes Main Street, then the label is MainSt, not something like OldUS30.

OldUS40
If the old highway has a posted name that mentions the old designation, then applying the above rule will result in a label like OldUS40 for "Old Route 40" if the route was formerly US 40.
There will be a good number of false-positive, legitimate labels here.
I'm not sure how closely these rules (if it's a requirement & not an option) have been followed; there are a lot of examples where "OldRt" or "OldHwy", etc. are used if a road is signed as such in the field, or if that's the official name. (Although, Tim may have done it that way before these specific guidelines were codified...) May be something worth discussing in the forum.



Long words, >4 characters
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep ':.*[a-z]{4}'
Intersections with named highways
Quote
FaiRd PapMillRd
For up to two other (specifying) words, truncate the word as follows:
1-4 letters - use whole word
5+ letters - use the first 3 letters. Don't use a made-up abbreviation. Fairchild Road becomes FaiRd, not FrchldRd or FchRd or anything else.



Too many words
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep -v '[NPS]P$|:[A-Z]*[a-z]*Mc[A-Z][A-Z][a-z]*$|:Mc[A-Z]{2,}[a-z]*[A-Z][a-z]*$|:[A-Z]*Mc[A-Z]+[a-z]?$' | egrep '[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]*|[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]?[A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]?[A-Z]+[a-z]+[A-Z]+[a-z]*|[A-Z]+[a-z]?[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]*'
Intersections with named highways
Quote
MarKingBlvd MLKingBlvd
If the cross road name has more than 3 words, use one of two options:
1. Pick out the two most important words besides the road type and use only those: Martin Luther King Boulevard becomes MarKingBlvd. Three words in total are included in shortened form.
2. Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.
An older style for 3-words road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd.
This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.
Edit, 2019-11-17 21:40 EST: Use egrep for BSD (e.g. noreaster) compatibility due to + token.



McDonRd
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep 'Mc[A-Z][a-z]'
In the above search, the Mc[A-Z] tokens are included to filter out "Old McDonald Farm Rd" style false positives.
OTOH, McDonald is one word, so it makes sense to truncate it to McD, rather than treat it as 2 for McDon.



Tpke for Turnpike
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep 'Tpke'
CHM standardized on "Tpk" for turnpike, standard abbreviations notwithstanding...



Directional prefixes
egrep -sv "^\+|^.[a-z]|^$rg[0-9]+" $rg/*/*.wpt | cut -f1 -d' ' | grep ':[NEWS]' | egrep -v ":[A-Z][A-Z]/[A-Z][A-Z]$|:[NEWS]End$"
Intersections with named highways
Quote
6thSt | 33rdAve | SeeLn
Ignore any non-essential direction specifier. N. 6th St becomes 6thSt. 33rd Avenue SW becomes 33rdAve. W. Seedy Lane becomes SeeLn.

NorPkwy | SouBlvd
But keep directions that are the main part of the road name, such as NorPkwy for Northern Parkway or SouBlvd for Southeast Boulevard.
What's a "non-essential direction specifier" and what are "directions that are the main part of the road name" can get wibbly-wobbly at times; use your best judgment. Many of these may be legit, but they're worth checking out.
IMO though, N/E/W/S are common enough abbreviations for their respective directions, and there are enough examples of this throughout the data, that I'm not going to call for every WPondRd to be changed to a WestPondRd.



Similarly, a search for suffixed directional specifiers:
egrep -sv '^\+' $rg/*/*.wpt | cut -f1 -d' ' | grep '[NEWS]$' | egrep -v '_|[0-9]{1,3}[NEWS]$|:[A-Z][A-Z]/[A-Z][A-Z]$|:CR[A-Z]$|:Rd[A-Z]$|:Ave[A-Z]$|:I\-[0-9]{1,3}BS$'

Edit, 2019-11-17 20:11 EST: Use cut instead of sed to isolate primary labels. More concise, clear, fast.
Edit, 2019-11-17 21:08 EST: Use grep -s to suppress error messages for regions with no WPTs.
Edit, 2019-11-18 01:07 EST: Use egrep for more human-readable expression strings, particularly with | and {}
Edit, 2019-11-18 01:53 EST: Exclude hidden points from all searches.
« Last Edit: November 18, 2019, 01:54:17 am by yakra »

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2373
  • Last Login:Yesterday at 11:29:13 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #1 on: October 06, 2019, 01:12:13 pm »
Rte(Number) style labels on US Interstates
for file in */usai/*.wpt; do results=`grep -H . $file | cut -f1 -d' ' | grep 'I-[0-9][0-9]*(..*)' | wc -l`; if [ $results -gt 2 ]; then grep -H . $file | cut -f1 -d' ' | grep 'I-[0-9][0-9]*(..*)'; fi; done
One point of confusion can be the two styles of waypoint labeling for concurrencies with Interstate Highways and other exit numbered routes.

Interchanges on exit-numbered highways
Quote
I-80: 56 | 87(75) | 89(75) | 63
In multiplexes where the concurrency uses exit numbers from the other highway, put the highway number in parentheses. Drop the letter prefix of the concurrent highway if it is more than one character long: I-75 becomes (75). A5 can stay as (A5).
The vast majority of Interstates (with rare exceptions, most obviously WY I-180) will be exit-numbered routes, and thus fall into this category.

Waypoint labels for multiplexes
Quote
US90: LA76 | I-49(45) | I-49(47) | I-49(52) | US40
For non-exit-numbered routes concurrent with a numbered, exit-numbered route, use the concurrent highway designation with the exit numbers in parentheses.
We'll sometimes see this erroneously; this basic (Interstate-to-Interstate only) search is designed to pick up on these cases.
It only prints results for routes with at least 3 such labels: Some routes, such as NY I-490 or ME I-295, have both ends at unnumbered interchanges with the parent route, necessitating this style label. Some routes, such as TX I-69, are not fully exit-numbered yet, and contain some crossroad-style labels.
Right now, it reports some malformed labels in ms.i055.wpt & wi.i041.wpt.



How far down this rabbit hole can we go?
An experiment. A somewhat unwieldy one...
rm -f exit_numbers.log looks_unnumbered.log looks_numbered.log
touch exit_numbers.log looks_unnumbered.log looks_numbered.log
for rg in `ls -F | grep '/$' | cut -f1 -d/`; do
  for sys in `ls -F $rg | grep '/$' | cut -f1 -d/`; do
    for file in `ls $rg/$sys | grep '\.wpt$'`; do
      # This gets messed up by files with spaces in their names, and ignores them.
      if [ -r $rg/$sys/$file ]; then
        LooksNumbered=`cut -f1 -d' ' $rg/$sys/$file | grep -c -m 1 '^[0-9][0-9]*$'`
        if [ $LooksNumbered == 0 ]; then
          results=`cut -f1 -d' ' $rg/$sys/$file | grep '^\*\?[0-9][0-9]*[A-Z]\?([0-9A-Za-z][0-9A-Za-z]*)' | wc -l`
          if (( $results != 0 && $results != `cat $rg/$sys/$file | wc -l` )); then
            echo -e "\n$rg/$sys/$file looks like a non-exit-numbered route" \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
            cut -f1 -d' ' $rg/$sys/$file \
            | grep '^\*\?[0-9][0-9]*[A-Z]\?([0-9A-Za-z][0-9A-Za-z]*)' \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
          fi
        fi
        if [ $LooksNumbered == 1 ]; then
          results=`cut -f1 -d' ' $rg/$sys/$file | grep '^\*\?[A-Z][A-Z]*[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' | wc -l`
          if [ $results -gt 2 ]; then
            echo -e "\n$rg/$sys/$file looks like an exit-numbered highway" \
            | tee -a exit_numbers.log | tee -a looks_numbered.log
            cut -f1 -d' ' $rg/$sys/$file \
            | grep '^\*\?[A-Z][A-Z]*[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' \
            | tee -a exit_numbers.log | tee -a looks_numbered.log
          fi
        fi
      fi
    done
  done
done
lbc=`cut -f1 -d' ' exit_numbers.log | grep -v '\.wpt$' | wc -w`
rtc=`cut -f1 -d' ' exit_numbers.log | grep '\.wpt$' | wc -l`
rgc=`cut -f1 -d' ' exit_numbers.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
lbw=`echo $lbc | wc -c`
rtw=`echo $rtc | wc -c`
rgw=`echo $rgc | wc -c`
printf   'looks   numbered: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' \
   `cut -f1 -d' ' looks_numbered.log | grep -v '\.wpt$' | wc -w` \
   `cut -f1 -d' ' looks_numbered.log | grep '\.wpt$' | wc -l` \
   `cut -f1 -d' ' looks_numbered.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
printf   'looks unnumbered: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' \
   `cut -f1 -d' ' looks_unnumbered.log | grep -v '\.wpt$' | wc -w` \
   `cut -f1 -d' ' looks_unnumbered.log | grep '\.wpt$' | wc -l` \
   `cut -f1 -d' ' looks_unnumbered.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
printf   '        combined: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' $lbc $rtc $rgc

Works on my Linux machines, but not noreaster, for reasons I've not looked into. Fixed!
looks   numbered:    600 labels in   58 routes in   26 regions
looks unnumbered:  12425 labels in  703 routes in  102 regions
        combined:  13025 labels in  761 routes in  112 regions
...Back away slowly.
« Last Edit: November 17, 2019, 04:36:11 pm by yakra »

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2373
  • Last Login:Yesterday at 11:29:13 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #2 on: November 18, 2019, 03:16:35 am »
Label lacks generic highway type
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep 'Old[0-9]+_|Old[0-9]+$'
Label should be OldUS42, OldHwy42, OldRte42, OldME42, etc., not "Old42". This was a datacheck on CHM, but although a comment describing the siteupdate program's DatacheckEntry class mentions a LACKS_GENERIC datacheck code, there's nothing actually implementing it yet.

There are only 4 results project-wide. Not bad.
FL/usaus/fl.us027.wpt: CROld50 (should be WasSt?)
NOR/eure/nor.e18.wpt: *Old70
NOR/eure/nor.e18.wpt: *Old67
NOR/eursf/nor.oslokrimotkri.wpt: *Old70
ON/canon/on.on127.wpt: Old127

Offline neroute2

  • Sr. Member
  • ****
  • Posts: 355
  • Last Login:Yesterday at 10:44:31 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #3 on: November 18, 2019, 08:45:36 am »
FL/usaus/fl.us027.wpt: CROld50 (should be WasSt?)
It's a numbered county road with Old 50 or Old Hwy 50 in the shield. False positive.

Offline michih

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 1941
  • Last Login:Yesterday at 09:14:02 am
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #4 on: November 18, 2019, 11:50:36 am »
NOR/eure/nor.e18.wpt: *Old70
NOR/eure/nor.e18.wpt: *Old67
NOR/eursf/nor.oslokrimotkri.wpt: *Old70

Former location of exit 70 and exit 67. Fine to me.

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2373
  • Last Login:Yesterday at 11:29:13 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #5 on: November 18, 2019, 01:17:40 pm »
It's a numbered county road with Old 50 or Old Hwy 50 in the shield. False positive.
Oh wows. I did look for something like that, just not hard enough. :D

Offline rickmastfan67

  • TM Collaborator (A)
  • Hero Member
  • *****
  • Posts: 778
  • Gender: Male
  • Last Login:December 11, 2019, 08:33:13 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #6 on: November 21, 2019, 04:28:56 am »
It's a numbered county road with Old 50 or Old Hwy 50 in the shield. False positive.
Oh wows. I did look for something like that, just not hard enough. :D

Welcome to the world of insanity in FL.  :P

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2373
  • Last Login:Yesterday at 11:29:13 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #7 on: November 22, 2019, 01:19:22 am »
Could that be considered a form of bannered route?  ;D