User Discussions > How To?

Using grep and shell scripts to search highway data, waypoint labels, etc.

(1/5) > >>

grep is a powerful tool to search for regular expressions in text files. It runs on Unix-like systems such as Linux, FreeBSD (e.g. noreaster) or Macintosh. With a little practice, you can put together expressions to search for very specific types of text strings. In this post, I'll focus on waypoint labels.

Commands were executed from the hwy_data/ directory.
They're designed to easily substitute different regions into the $rg variable, so you can execute the same command different times to search different regions.
grep whatever $rg/*/*.wpt
grep whatever $rg/*/*.wpt
et cetera. You can also wrap them up inside loops to search multiple regions in one go:
for rg in PA DE; do grep whatever $rg/*/*.wpt; done
MyRegions='PA DE'
for rg in $MyRegions; do grep whatever $rg/*/*.wpt; done
et cetera.

The cut command was included to strip out AltLabels and URLs from the .wpt file lines, to ensure that we're only searching within primary labels.
If you have questions about the various tokens, expressions or commands, feel free to ask.

Some example search results in Pennsylvania and Delaware are posted here.

Potential no-underscore malformed city suffixes
grep -v '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep -v '[A-Z][A-Z][0-9]{1,3}Alt|[A-Z][A-Z][0-9]{1,3}Bus|[A-Z][A-Z][0-9]{1,3}Byp|[A-Z][A-Z][0-9]{1,3}Trk|[A-Z][A-Z][0-9]{1,3}His|[A-Z][A-Z][0-9]{1,3}Spr|[A-Z][A-Z][0-9]{1,3}Con' | egrep ':[A-Z]{2}[0-9]{1,3}[A-Z][a-z]{2}'
This one can get confusing.
Edit, 2019-11-18 01:32 EST: Only check primary labels; avoid false negatives.

Intersections with visibly numbered highways

--- Quote ---US40: US40BusWhi | US73: US40BusWhi US40BusTho | A3: A3Zur
Distinguish two different same-bannered same-numbered routes as needed with the 3-letter city abbreviations. Also use the city abbreviation for bannerless same-designation spurs or branches, such as the Zurich A3 spur intersecting the main A3.

--- End quote ---
While we do want to use city abbreviations for routes (in the HB) that have them...

Distinguishing otherwise identical waypoints (not for exit numbers)

--- Quote ---US95: NV57_Ren | NV57_Tah
If more than two points for the same non-exit-numbered cross road are needed, there are two options which can be used in combination with or ignoring the previous options for pairs of identical labels.
1. Use alphabetical suffixes _A, _B, _C, etc.
2. Choose 3-letter suffixes for a nearby towns if they are fairly close. The 3-letter suffix should be the first 3 letters of the town name. Add a suffix with an underscore and those 3 letters.

--- End quote ---
...disambiguating multiple intersections with a single route, or with roads not in the HB, is done with an underscore.
A lot of time, people may use the former when they intended to use the latter.

Underscored 4-character city suffixes not ending in N, E, W, or S
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep '_...[A-DF-MO-RTUVXYZa-z]$'
Distinguishing otherwise identical waypoints (not for exit numbers)

--- Quote ---NV88_PitS | NV88_PitN
If you need the same town twice, add a 4th letter that is a direction letter (_PitS and _PitN for southern and northern junctions near Pittston). If the town suffixes are not useful, are confusing, or require further elaboration (3+ junctions with same town), use the alphabetical suffixes instead.

--- End quote ---
Edit, 2019-11-17 20:58 EST: Only check primary labels.

2-character city suffixes
egrep -sv '^\+' $rg/*/*.wpt | cut -f1 -d' ' | grep -v '_U[0-9]$' | grep '_..$'
Edit, 2019-11-17 22:08 EST: Exclude false positives caused by numbered U-turn ramps.

Old highway designation labels
grep -s '^Old..[0-9]' $rg/*/*.wpt
Waypoints for roads that no longer have a name or no longer exist as a road

--- Quote ---MainSt
If the old highway has a posted name or number, use that name or number (don't make one up), and label the waypoint according to the usual rules.
If US 30/Main Street becomes Main Street, then the label is MainSt, not something like OldUS30.
If the old highway has a posted name that mentions the old designation, then applying the above rule will result in a label like OldUS40 for "Old Route 40" if the route was formerly US 40.

--- End quote ---
There will be a good number of false-positive, legitimate labels here.
I'm not sure how closely these rules (if it's a requirement & not an option) have been followed; there are a lot of examples where "OldRt" or "OldHwy", etc. are used if a road is signed as such in the field, or if that's the official name. (Although, Tim may have done it that way before these specific guidelines were codified...) May be something worth discussing in the forum.

Long words, >4 characters
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep ':.*[a-z]{4}'
Intersections with named highways

--- Quote ---FaiRd PapMillRd
For up to two other (specifying) words, truncate the word as follows:
1-4 letters - use whole word
5+ letters - use the first 3 letters. Don't use a made-up abbreviation. Fairchild Road becomes FaiRd, not FrchldRd or FchRd or anything else.

--- End quote ---

Too many words
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep -v ':[A-Z]*[a-z]*Mc[A-Z][A-Z][a-z]*$|:Mc[A-Z]{2,}[a-z]*[A-Z][a-z]*$|:[A-Z]*Mc[A-Z]+[a-z]?$' | egrep '[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]*|[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]?[A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]?[A-Z]+[a-z]+[A-Z]+[a-z]*|[A-Z]+[a-z]?[A-Z]+[a-z]+[A-Z]+[a-z]+[A-Z]+[a-z]*'
Intersections with named highways

--- Quote ---MarKingBlvd MLKingBlvd
If the cross road name has more than 3 words, use one of two options:
1. Pick out the two most important words besides the road type and use only those: Martin Luther King Boulevard becomes MarKingBlvd. Three words in total are included in shortened form.
2. Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.

--- End quote ---
An older style for 3-word road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd.
This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.
Edit, 2019-11-17 21:40 EST: Use egrep for BSD (e.g. noreaster) compatibility due to + token.

grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep 'Mc[A-Z][a-z]'
In the above search, the Mc[A-Z] tokens are included to filter out "Old McDonald Farm Rd" style false positives.
OTOH, McDonald is one word, so it makes sense to truncate it to McD, rather than treat it as 2 for McDon.

Tpke for Turnpike
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | grep 'Tpke'
CHM standardized on "Tpk" for turnpike, standard abbreviations notwithstanding...

Directional prefixes
egrep -sv "^\+|^.[a-z]|^$rg[0-9]+" $rg/*/*.wpt | cut -f1 -d' ' | grep ':[NEWS]' | egrep -v ":[A-Z][A-Z]/[A-Z][A-Z]$|:[NEWS]End$"
Intersections with named highways

--- Quote ---6thSt | 33rdAve | SeeLn
Ignore any non-essential direction specifier. N. 6th St becomes 6thSt. 33rd Avenue SW becomes 33rdAve. W. Seedy Lane becomes SeeLn.
NorPkwy | SouBlvd
But keep directions that are the main part of the road name, such as NorPkwy for Northern Parkway or SouBlvd for Southeast Boulevard.

--- End quote ---
What's a "non-essential direction specifier" and what are "directions that are the main part of the road name" can get wibbly-wobbly at times; use your best judgment. Many of these may be legit, but they're worth checking out.
IMO though, N/E/W/S are common enough abbreviations for their respective directions, and there are enough examples of this throughout the data, that I'm not going to call for every WPondRd to be changed to a WestPondRd.

Similarly, a search for suffixed directional specifiers:
egrep -sv '^\+' $rg/*/*.wpt | cut -f1 -d' ' | grep '[NEWS]$' | egrep -v '_|[0-9]{1,3}[NEWS]$|:[A-Z][A-Z]/[A-Z][A-Z]$|:CR[A-Z]$|:Rd[A-Z]$|:Ave[A-Z]$|:I\-[0-9]{1,3}BS$'

Edit, 2019-11-17 20:11 EST: Use cut instead of sed to isolate primary labels. More concise, clear, fast.
Edit, 2019-11-17 21:08 EST: Use grep -s to suppress error messages for regions with no WPTs.
Edit, 2019-11-18 01:07 EST: Use egrep for more human-readable expression strings, particularly with | and {}
Edit, 2019-11-18 01:53 EST: Exclude hidden points from all searches.

Rte(Number) style labels on US Interstates
for file in */usai/*.wpt; do results=`grep -H . $file | cut -f1 -d' ' | grep 'I-[0-9][0-9]*(..*)' | wc -l`; if [ $results -gt 2 ]; then grep -H . $file | cut -f1 -d' ' | grep 'I-[0-9][0-9]*(..*)'; fi; done
One point of confusion can be the two styles of waypoint labeling for concurrencies with Interstate Highways and other exit numbered routes.

Interchanges on exit-numbered highways

--- Quote ---I-80: 56 | 87(75) | 89(75) | 63
In multiplexes where the concurrency uses exit numbers from the other highway, put the highway number in parentheses. Drop the letter prefix of the concurrent highway if it is more than one character long: I-75 becomes (75). A5 can stay as (A5).

--- End quote ---
The vast majority of Interstates (with rare exceptions, most obviously WY I-180) will be exit-numbered routes, and thus fall into this category.

Waypoint labels for multiplexes

--- Quote ---US90: LA76 | I-49(45) | I-49(47) | I-49(52) | US40
For non-exit-numbered routes concurrent with a numbered, exit-numbered route, use the concurrent highway designation with the exit numbers in parentheses.

--- End quote ---
We'll sometimes see this erroneously; this basic (Interstate-to-Interstate only) search is designed to pick up on these cases.
It only prints results for routes with at least 3 such labels: Some routes, such as NY I-490 or ME I-295, have both ends at unnumbered interchanges with the parent route, necessitating this style label. Some routes, such as TX I-69, are not fully exit-numbered yet, and contain some crossroad-style labels.
Right now, it reports some malformed labels in ms.i055.wpt & wi.i041.wpt.

How far down this rabbit hole can we go?
An experiment. A somewhat unwieldy one...
rm -f exit_numbers.log looks_unnumbered.log looks_numbered.log
touch exit_numbers.log looks_unnumbered.log looks_numbered.log
for rg in `ls -F | grep '/$' | cut -f1 -d/`; do
  for sys in `ls -F $rg | grep '/$' | cut -f1 -d/`; do
    for file in `ls $rg/$sys | grep '\.wpt$'`; do
      # This gets messed up by files with spaces in their names, and ignores them.
      if [ -r $rg/$sys/$file ]; then
        LooksNumbered=`cut -f1 -d' ' $rg/$sys/$file | grep -c -m 1 '^[0-9][0-9]*$'`
        if [ $LooksNumbered == 0 ]; then
          results=`cut -f1 -d' ' $rg/$sys/$file | grep '^\*\?[0-9][0-9]*[A-Z]\?([0-9A-Za-z][0-9A-Za-z]*)' | wc -l`
          if (( $results != 0 && $results != `cat $rg/$sys/$file | wc -l` )); then
            echo -e "\n$rg/$sys/$file looks like a non-exit-numbered route" \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
            cut -f1 -d' ' $rg/$sys/$file \
            | grep '^\*\?[0-9][0-9]*[A-Z]\?([0-9A-Za-z][0-9A-Za-z]*)' \
            | tee -a exit_numbers.log | tee -a looks_unnumbered.log
        if [ $LooksNumbered == 1 ]; then
          results=`cut -f1 -d' ' $rg/$sys/$file | grep '^\*\?[A-Z][A-Z]*[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' | wc -l`
          if [ $results -gt 2 ]; then
            echo -e "\n$rg/$sys/$file looks like an exit-numbered highway" \
            | tee -a exit_numbers.log | tee -a looks_numbered.log
            cut -f1 -d' ' $rg/$sys/$file \
            | grep '^\*\?[A-Z][A-Z]*[A-Za-z]*-\?[0-9]*[A-Za-z]*([A-Z]*[0-9]*[A-Z]*)$' \
            | tee -a exit_numbers.log | tee -a looks_numbered.log
lbc=`cut -f1 -d' ' exit_numbers.log | grep -v '\.wpt$' | wc -w`
rtc=`cut -f1 -d' ' exit_numbers.log | grep '\.wpt$' | wc -l`
rgc=`cut -f1 -d' ' exit_numbers.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
lbw=`echo $lbc | wc -c`
rtw=`echo $rtc | wc -c`
rgw=`echo $rgc | wc -c`
printf   'looks   numbered: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' \
   `cut -f1 -d' ' looks_numbered.log | grep -v '\.wpt$' | wc -w` \
   `cut -f1 -d' ' looks_numbered.log | grep '\.wpt$' | wc -l` \
   `cut -f1 -d' ' looks_numbered.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
printf   'looks unnumbered: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' \
   `cut -f1 -d' ' looks_unnumbered.log | grep -v '\.wpt$' | wc -w` \
   `cut -f1 -d' ' looks_unnumbered.log | grep '\.wpt$' | wc -l` \
   `cut -f1 -d' ' looks_unnumbered.log | grep '\.wpt$' | cut -f1 -d/ | uniq | wc -l`
printf   '        combined: %'"$lbw"'d labels in %'"$rtw"'d routes in %'"$rgw"'d regions\n' $lbc $rtc $rgc
Works on my Linux machines, but not noreaster, for reasons I've not looked into. Fixed!
looks   numbered:    600 labels in   58 routes in   26 regions
looks unnumbered:  12425 labels in  703 routes in  102 regions
        combined:  13025 labels in  761 routes in  112 regions
...Back away slowly.

Label lacks generic highway type
grep -sv '^+' $rg/*/*.wpt | cut -f1 -d' ' | egrep 'Old[0-9]+_|Old[0-9]+$'
Label should be OldUS42, OldHwy42, OldRte42, OldME42, etc., not "Old42". This was a datacheck on CHM, but although a comment describing the siteupdate program's DatacheckEntry class mentions a LACKS_GENERIC datacheck code, there's nothing actually implementing it yet.

There are only 4 results project-wide. Not bad.
FL/usaus/fl.us027.wpt: CROld50 (should be WasSt?)
NOR/eure/nor.e18.wpt: *Old70
NOR/eure/nor.e18.wpt: *Old67
NOR/eursf/nor.oslokrimotkri.wpt: *Old70
ON/canon/on.on127.wpt: Old127


--- Quote from: yakra on November 18, 2019, 03:16:35 am ---FL/usaus/fl.us027.wpt: CROld50 (should be WasSt?)

--- End quote ---
It's a numbered county road with Old 50 or Old Hwy 50 in the shield. False positive.


--- Quote from: yakra on November 18, 2019, 03:16:35 am ---NOR/eure/nor.e18.wpt: *Old70
NOR/eure/nor.e18.wpt: *Old67
NOR/eursf/nor.oslokrimotkri.wpt: *Old70
--- End quote ---

Former location of exit 70 and exit 67. Fine to me.


[0] Message Index

[#] Next page

Go to full version