Author Topic: Using grep to search highway data, waypoint labels, etc.  (Read 9243 times)

0 Members and 1 Guest are viewing this topic.

Offline cvoight

  • Newbie
  • *
  • Posts: 18
  • Last Login:June 18, 2020, 02:31:07 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #15 on: January 16, 2020, 01:52:09 pm »
Quote
RayWinSP, GreSmoNP
Highway ends at park: For ends at a park or other non-commercial endpoint, abbreviate the park name as if it were a named highway (see rules above). Use NP for national park, PP for provincial park, SP for state park.

First let's establish some baseline numbers. Park suffixes should be preceded by at least one lowercase letter and followed by a space and at some point an OSM URL. This screens out borders with Spain (ESP) and any wpt files that aren't true wpt files (some stuff in ITA).
Code: [Select]
PS ..\hwy_data> (Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[a-z]+NP\s.*http").length
75
PS ..\hwy_data> (Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[a-z]+PP\s.*http").length
163
PS ..\hwy_data> (Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[a-z]+SP\s.*http").length
225

Theoretically the US will not have provincial parks PP, so let's see if there are any waypoint labels for provincial parks in the US:
Code: [Select]
PS ..\hwy_data> (Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[a-z]+PP\s.*http").length
11
A surprising amount, but upon investigation they are all MOsPP -- waypoints for Missouri supplemental routes, which are interesting. [brief, old forums discussion]

We can screen these out by requiring two or more lowercase letters [a-z]{2,} before a NP/PP/SP suffix and see what happens. NP stays at 75, SP decreases to 219 (-6), and PP decreases to 151 (-12). We are missing 1 result from PP, so let's see if we can find the odd one out:
Code: [Select]
PS ..\hwy_data> Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:AB|BC|MB|N[BLST]|ON|PE|QC|SK|YT)$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[A-Z][a-z]{1}PP\s.*http"
BC\canbc\bc.bc097.wpt:284:WhiPtPP http://www.openstreetmap.org/?lat=54.908111&lon=-122.933021
It turns out that BC97 has many waypoints with the suffix PP, which is not something the style guidelines mention.

To see what's going on, we could attempt to construct a search query that finds all the NP/PP/SP waypoints that aren't in the first or last line of a file. But it's a lot easier to conduct a cursory perusal of the results for SP and PP and look for files with more than two hits since you can't have more than two ends in a highway wpt file. What we find is that this nomenclature is widespread, and may merit a specific mention in the style guidelines.
Code: [Select]
PS ..\hwy_data> Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "[a-z]{2,}SP\s.*http"
[...]
CA\usaca\ca.ca001.wpt:7:CryCoveSP http://www.openstreetmap.org/?lat=33.564211&lon=-117.826000
CA\usaca\ca.ca001.wpt:183:HarHeaSP http://www.openstreetmap.org/?lat=35.478241&lon=-120.991826
CA\usaca\ca.ca001.wpt:187:SanSimSP http://www.openstreetmap.org/?lat=35.593590&lon=-121.124847
CA\usaca\ca.ca001.wpt:201:LimBeaSP http://www.openstreetmap.org/?lat=36.008557&lon=-121.518209
CA\usaca\ca.ca001.wpt:205:JulPfeSP http://www.openstreetmap.org/?lat=36.158883&lon=-121.670644
CA\usaca\ca.ca001.wpt:208:BigSurSP http://www.openstreetmap.org/?lat=36.252768&lon=-121.787332
CA\usaca\ca.ca001.wpt:209:AndMolSP http://www.openstreetmap.org/?lat=36.288382&lon=-121.844172
CA\usaca\ca.ca001.wpt:353:RusGulSP http://www.openstreetmap.org/?lat=39.329785&lon=-123.804808
[...]
OR\usaus\or.us026.wpt:2:KloCreSP http://www.openstreetmap.org/?lat=45.921155&lon=-123.895140
OR\usaus\or.us026.wpt:15:HumCreRd +SunHwySP http://www.openstreetmap.org/?lat=45.889471&lon=-123.625449
OR\usaus\or.us026.wpt:177:OchLakeSP http://www.openstreetmap.org/?lat=44.307679&lon=-120.698118
OR\usaus\or.us026.wpt:228:ClyHolSP http://www.openstreetmap.org/?lat=44.417344&lon=-119.090466
[...]
« Last Edit: January 16, 2020, 01:55:02 pm by cvoight »

Offline michih

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2861
  • Last Login:Yesterday at 02:24:15 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #16 on: January 16, 2020, 02:42:47 pm »
@cvoight, thanks for your effort.

We have automated checks which are reported and updated with every site update: http://travelmapping.net/devel/datacheck.php

I just wonder, why some are not detected, e.g.:

Code: [Select]
NLD\nlda\nld.a325.wpt:2:Elden http://www.openstreetmap.org/?lat=51.958363&lon=5.888801
NLD\nlda\nld.a325.wpt:3:BurgMatsl http://www.openstreetmap.org/?lat=51.950772&lon=5.878973

@yakra? Any idea, should I open a Github issue?
not yakra, but I did peruse siteupdate.py, specifically the results for datacheckerrors.append which I believe is a comprehensive accounting of all the data checks. Generally, the waypoint label checks currently implemented don't test for consistency with the labeling style guidelines

True, we just check long underscores.

Offline cvoight

  • Newbie
  • *
  • Posts: 18
  • Last Login:June 18, 2020, 02:31:07 pm
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #17 on: January 20, 2020, 03:42:40 pm »
A variation on potential no-underscore malformed city suffixes from the OP.

Quote
US40: US40BusWhi; US73: US40BusWhi, US40BusTho; A3: A3Zur
Intersections with visibly numbered highways: Distinguish two different same-bannered same-numbered routes as needed with the 3-letter city abbreviations. Also use the city abbreviation for bannerless same-designation spurs or branches, such as the Zurich A3 spur intersecting the main A3.

Start by identifying some candidates for inclusion: find all the US primary waypoint labels that have a non-underscored suffix where 1 uppercase character is followed by 3 or more lowercase characters (38 results). (Similar idea holds across the project, it's just a lot easier to focus on the US for posting results.)

Code: [Select]
PS ..\hwy_data> Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "^[^\s]+[0-9]+[A-Z][a-z]{3,}[A-Z]*\s.*http"

--AL\usaal\al.al025.wpt:13:CR16Hale http://www.openstreetmap.org/?lat=32.600940&lon=-87.592294
--AL\usaal\al.al025.wpt:27:CR49Hale http://www.openstreetmap.org/?lat=32.888960&lon=-87.426410
--AL\usaal\al.al025.wpt:30:CR16Bibb http://www.openstreetmap.org/?lat=32.903288&lon=-87.313542
--AL\usaal\al.al051.wpt:13:CR11Dale http://www.openstreetmap.org/?lat=31.577896&lon=-85.735438
--AL\usaal\al.al053.wpt:13:TN7Truck http://www.openstreetmap.org/?lat=34.991991&lon=-86.844633
AL\usaus\al.us011.wpt:6:CR2York http://www.openstreetmap.org/?lat=32.482895&lon=-88.312300
AL\usaus\al.us011.wpt:7:CR19York http://www.openstreetmap.org/?lat=32.487959&lon=-88.299555
AL\usaus\al.us011.wpt:16:CR20Epes http://www.openstreetmap.org/?lat=32.689060&lon=-88.128548
AL\usaus\al.us011.wpt:17:CR21Epes http://www.openstreetmap.org/?lat=32.689755&lon=-88.122915
AL\usaus\al.us011.wpt:99:AL75Conn http://www.openstreetmap.org/?lat=33.587713&lon=-86.699638
AL\usaus\al.us011.wpt:135:AL35Conn http://www.openstreetmap.org/?lat=34.430120&lon=-85.732729
AL\usaus\al.us029.wpt:21:CR25Rome http://www.openstreetmap.org/?lat=31.141818&lon=-86.668611
AL\usaus\al.us084.wpt:109:CR31Dale http://www.openstreetmap.org/?lat=31.268918&lon=-85.673044
AL\usaus\al.us098.wpt:12:I-65Ramps http://www.openstreetmap.org/?lat=30.707093&lon=-88.122593
--AR\usaar\ar.ar125.wpt:18:Hwy125Park http://www.openstreetmap.org/?lat=36.490224&lon=-92.778170
--AR\usaar\ar.ar300.wpt:8:Hwy300Spur http://www.openstreetmap.org/?lat=34.937503&lon=-92.586571
KS\usaus\ks.us183.wpt:53:17Terr http://www.openstreetmap.org/?lat=39.390206&lon=-99.296091
MN\usamn\mn.mn371.wpt:48:CR38Cass http://www.openstreetmap.org/?lat=47.121790&lon=-94.613078
MS\usaus\ms.us051.wpt:87:I55Front http://www.openstreetmap.org/?lat=32.405110&lon=-90.145912
MS\usaus\ms.us084.wpt:26:MS184Mead http://www.openstreetmap.org/?lat=31.465422&lon=-90.906458
MS\usaus\ms.us084.wpt:30:MS184Bude http://www.openstreetmap.org/?lat=31.469521&lon=-90.841870
MS\usaus\ms.us084.wpt:119:MS184Cleo http://www.openstreetmap.org/?lat=31.700203&lon=-89.037237
NC\usanc\nc.nc159.wpt:4:NC159Spur http://www.openstreetmap.org/?lat=35.633211&lon=-79.782693
NC\usaus\nc.us001.wpt:80:US158/1Conn http://www.openstreetmap.org/?lat=36.364915&lon=-78.361954
NC\usaus\nc.us023.wpt:16:US23Conn http://www.openstreetmap.org/?lat=35.381729&lon=-83.222433
NC\usaus\nc.us070.wpt:161:US70Conn http://www.openstreetmap.org/?lat=36.083329&lon=-79.148244
NC\usaus\nc.us074.wpt:36:US23Conn http://www.openstreetmap.org/?lat=35.381729&lon=-83.222433
NC\usaus\nc.us301.wpt:30:US301Conn http://www.openstreetmap.org/?lat=35.121901&lon=-78.763411
NC\usausb\nc.us019trkway.wpt:11:US23Conn http://www.openstreetmap.org/?lat=35.381729&lon=-83.222433
NC\usausb\nc.us023bussyl.wpt:2:US23Conn http://www.openstreetmap.org/?lat=35.373978&lon=-83.225909
NC\usausb\nc.us064trkhen.wpt:10:US23Conn http://www.openstreetmap.org/?lat=35.381729&lon=-83.222433
NC\usausb\nc.us220busash.wpt:8:US220Conn http://www.openstreetmap.org/?lat=35.736951&lon=-79.808335
NE\usane\ne.ne008.wpt:52:645Blvd http://www.openstreetmap.org/?lat=40.058998&lon=-95.720047
NE\usaus\ne.us030.wpt:137:15Blvd http://www.openstreetmap.org/?lat=41.452731&lon=-96.626269
OH\usaus\oh.us052.wpt:93:US23Spur http://www.openstreetmap.org/?lat=38.486542&lon=-82.638752
SC\usasc\sc.sc133.wpt:2:12MileRA http://www.openstreetmap.org/?lat=34.704488&lon=-82.833384
VA\usava\va.va365.wpt:3:VA365East http://www.openstreetmap.org/?lat=36.957932&lon=-81.072166
WA\usawa\wa.wa100.wpt:5:WA100Spur +BeaHolSP http://www.openstreetmap.org/?lat=46.289990&lon=-124.056281

Some of these are in preview (prefixed with --). Some of these are mystifying and I simply have no clue whether there is an issue, e.g. the AL US routes. And some we can do a quick analysis to find out whether they are consistently formatted across the project or are outliers, e.g. Conn and Spur. (Turns out, the latter.)
Code: [Select]
PS ..\hwy_data> (Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "^[^\s]+[0-9]+Con\s.*http").length
300 (Con suffix)
PS ..\hwy_data> (Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "^[^\s]+[0-9]+Conn\s.*http").length
11 (Conn suffix)

PS ..\hwy_data> (Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "^[^\s]+[0-9]+Spr\s.*http").length
342 (Spr suffix)
PS ..\hwy_data> (Get-ChildItem -Directory | Where-Object {$_.name -Match "^(?:A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOPST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$"} | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -CaseSensitive -Pattern "^[^\s]+[0-9]+Spur\s.*http").length
4 (Spur suffix)

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2930
  • Last Login:Today at 12:13:46 am
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #18 on: March 15, 2020, 11:29:56 pm »
Slightly more complicated: find all the waypoints in the US that start with I followed directly by a number. [Note: this is not quite restricted to the US as there are some non-US regions with highway directories that contain the string "usa", but it eliminates most regions and (currently) all false positives. See the code section in this post for a regex that matches only US states and territories.]
I have a git stash on my desktop implementing an INTERSTATE_NO_HYPHEN datacheck in the C++ siteupdate program. Haven't tackled Python yet.
Code: [Select]
inline void Waypoint::interstate_no_hyphen(DatacheckEntryList *datacheckerrors)
{ if (route->system->country->first == "USA" && label.size() >= 2)
{ const char *c = label.data();
if (c[0] == 'T' && c[1] == 'o') c += 2;
if (c[0] == 'I' && isdigit(c[1]))
  datacheckerrors->add(route, label, "", "", "INTERSTATE_NO_HYPHEN", "");
}
}

First, find all waypoints with a capital letter followed by 4 or more lowercase letters
We have automated checks which are reported and updated with every site update: http://travelmapping.net/devel/datacheck.php
I just wonder, why some are not detected, e.g.:
Code: [Select]
NLD\nlda\nld.a325.wpt:2:Elden http://www.openstreetmap.org/?lat=51.958363&lon=5.888801
NLD\nlda\nld.a325.wpt:3:BurgMatsl http://www.openstreetmap.org/?lat=51.950772&lon=5.878973
@yakra? Any idea, should I open a Github issue?
I implemented a LABEL_LONG_WORD datacheck on my own TM mirror (not currently online). Didn't yet merge the changes back into the TravelMapping/DataProcessing:master -- as cvoight records, there are still plenty of results. (More below...)
Probably should though  ;), if we're now expecting to see this datacheck now, and not seeing it causes confusion.

executed from the hwy_data directory: Get-ChildItem -Include "*.wpt" -Recurse
...
Scanning this file, I filter the results: delete many odd results from ITA\ita.nsa (2).wpt; delete Minnesota subway lines (MN\usamnmt\BlueLine.wpt & MN\usamnmt\GreenLine.wpt);
How about Get-ChildItem -Include "*/*/*.wpt" ?

Select-String -CaseSensitive -Pattern "[A-Z][a-z]{4,}"
Or even just "[a-z]{4,}"

all lines that contain a + [presumably waypoint alias] -- using Sublime Text, execute a regex search for ^.*\+.*$, select find all, and press delete twice;
We still want to search before the + sign though. Hence the | cut -f1 -d' ' on Unix-like systems.
Sounds like you know this per later posts though. :)

Parenthesizing your Select-String query and appending .length allows you to find the total number of results. For example, to find the number of waypoint labels with a US route as the primary highway (31,265):

Code: [Select]
PS ..\hwy_data> (Get-ChildItem -Directory "*usa*" -Recurse | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -Pattern "^US").length
31265
in Unix, we'd pipe the results to wc and do a line count:
grep '^US' */usa*/*.wpt | wc -l

I like the discussion on searching for extraneous highway prefixes. Did I experiment with a regex for this at some point? OK, yes. But I hadn't hit the mark yet. Now working with
grep -v '^+' */*/*.wpt | cut -f1 -d' ' | egrep -iv ':\*?[A-Z]+/[A-Z]+$' | egrep ':.*/[A-Za-z\-]{2,}[0-9]'
we get 123 results. I don't see anything that should be a false positive, other than arguably the Texas entries; I think the original idea behind those was to consider the "Lp" or "Spr" part of the designation & not part of the prefix, I.E. TX95 + TXLp363 = drop the 2nd "TX" = TX95/Lp363. Whatever the case though, I don't like those and plan on fixing them.
We can expand on this a little bit and drop the requirement for numeral(s) after the slash:
grep -v '^+' */*/*.wpt | cut -f1 -d' ' | grep ':.*[0-9]' | egrep -i ':.*/[A-Z\-]{2,}' | egrep -v '/[A-Z]{2}$'
This brings some local road names, named roads, and lettered county routes with extraneous "CR" back into the mix. Looks like these wouldn't be FPs either:
http://travelmapping.net/devel/manual/wayptlabels.php#slash
http://travelmapping.net/devel/manual/wayptlabels.php#dropnamed
Code: [Select]
AL/usaal/al.al013.wpt:US72/ALT72
AL/usaal/al.al017.wpt:US72/ALT72
AL/usaus/al.us043.wpt:US72/ALT72
AL/usaus/al.us072.wpt:US43/ALT72
AL/usaus/al.us090.wpt:I10/US98_W
AL/usaus/al.us090.wpt:I10/US98_E
AL/usaus/al.us098.wpt:I10/US90_W
AL/usaus/al.us098.wpt:I10/US90_E
BC/canbc/bc.bc003.wpt:BC93/BC95
BC/canbc/bc.bc093.wpt:BC3/BC95
BC/canbc/bc.bc095.wpt:BC3/BC93
CA/usaush/ca.us099hisind.wpt:1stSt/Hef
CHE/eurtr/che.gts.wpt:H10/AutGalsIns
CO/usaco/co.co030.wpt:6th/HavSt
CO/usaus/co.us085.wpt:I-25/I-70
CZE/eure/cze.e442.wpt:I30/II613
CZE/eure/cze.e461.wpt:I43/II640
CZE/eure/cze.e461.wpt:I42/II640
CZE/eure/cze.e551.wpt:I34/II634
CZE/eure/cze.e65.wpt:MO/II243
DEU-BY/deub/deuby.b013.wpt:St2256/St2419
DEU-BY/deub/deuby.b022bam.wpt:St2271/St2450
DEU-BY/eurtr/deuby.romstr.wpt:B466/St2212
DEU-BY/eurtr/deuby.romstr.wpt:B300/St2051
DEU-BY/eurtr/deuby.romstr.wpt:B17/St2014
DEU-BY/eurtr/deuby.romstr.wpt:B23/St2059
DEU-BY/eurtr/deuby.romstr.wpt:B17/St2059
DEU-BY/eurtr/deuby.romstrwur.wpt:B8/St2300
DNK/eurtr/dnk.mraal.wpt:PR55/SR180
DNK/eurtr/dnk.mresb.wpt:PR11/SR175
DNK/eurtr/dnk.mresb.wpt:E20/PR24
DNK/eurtr/dnk.mresb.wpt:PR15/SR181
DNK/eurtr/dnk.mrnyk.wpt:PR9/SR297
DNK/eurtr/dnk.mrnyk.wpt:PR9/SR289
DNK/eurtr/dnk.mrode.wpt:PR16/SR563
DNK/eurtr/dnk.mrode.wpt:PR21/SR563
DNK/eurtr/dnk.mrode.wpt:PR52/SR451
DNK/eurtr/dnk.mrode.wpt:PR8/SR167
DNK/eurtr/dnk.mrsja.wpt:SR231/PR57
DNK/eurtr/dnk.mrsja.wpt:SR207/SR211
DNK/eurtr/dnk.mrsja.wpt:SR211/PR16
DNK/eurtr/dnk.mrsja.wpt:PR16/SR205
DNK/eurtr/dnk.mrsja.wpt:PR14/SR155
DNK/eurtr/dnk.mrsve.wpt:SR305/PR9
DNK/eurtr/dnk.mrvib.wpt:PR26/SR186
DNK/eurtr/dnk.mr.wpt:PR26/SR545
DNK/eurtr/dnk.mr.wpt:PR26/SR181
DNK/eurtr/dnk.mr.wpt:E39/SR597
DNK/eurtr/dnk.mr.wpt:E45/SR585
ESP-AN/espa/espan.ca035.wpt:AP4/CA32
ESP-CM/espa/espcm.ap041.wpt:A40/TO22
ESP-CM/espn/espcm.n403.wpt:TO21/CM40
ESP-MC/espmc/espmc.rm015.wpt:A7/MU30
FRA-GES/fragesd55/frages.d069455.wpt:N135/NVS
FRA-IDF/eure/fraidf.e15.wpt:A3/BlvdPer
FRA-IDF/eure/fraidf.e15.wpt:A6b/BlvdPer
FRA-IDF/eure/fraidf.e50.wpt:A6b/BlvdPer
FRA-IDF/eure/fraidf.e50.wpt:A4/BlvdPer
FRA-IDF/eure/fraidf.e5.wpt:A13/BlvdPer
FRA-IDF/eure/fraidf.e5.wpt:A6a/BlvdPer
IDN/asiahr/idn.ah152.wpt:N3/JTP
IND-PB/index/indpb.acexpy.wpt:NH5/NH7
IN/usain/in.in001.wpt:I-69/I-469
ISL/islth/isl.th041.wpt:TH45/TH429
ITA/eure/ita.e78.wpt:SS73/SS715
ITA/eure/ita.e80.wpt:A91/AGRA
ITA/eure/ita.e80.wpt:A24/AGRA
MA/usama/ma.ma002acam.wpt:US3/MA3
MEX-JAL/mexdn/mexjal.mex080d.wpt:MEX90D/GUA10D
MEX-JAL/mexdn/mexjal.mex090d.wpt:MEX80D/GUA10D
NC/usanc/nc.nc087.wpt:US311/NC770
NC/usanc/nc.nc308trkwin.wpt:US13/US17_S
NC/usaus/nc.us017.wpt:US13/US17BypWin_S
NC/usaus/nc.us158.wpt:US29Bus/NC87
NC/usaus/nc.us301.wpt:NC43/NC48_S
OH/usaoh/oh.oh012.wpt:OH115/OH189
OH/usaoh/oh.oh115.wpt:OH12/OH189
OH/usaoh/oh.oh189.wpt:OH12/OH115
OH/usaus/oh.us023.wpt:OH32/OH124
ON/canon/on.on058tho.wpt:ON20/RR20
OR/usaor/or.or010.wpt:US26/OR99W
OR/usaor/or.or019.wpt:I-84/US30
OR/usaor/or.or035.wpt:I-84/US30
OR/usaor/or.or038.wpt:I-5/OR99
OR/usaor/or.or039.wpt:US97/OR39Bus
OR/usaor/or.or074.wpt:I-84/US30
OR/usaor/or.or099e.wpt:I-84/US30
OR/usaor/or.or099.wpt:US199/OR238
OR/usaor/or.or099.wpt:OR99W/OR99E
OR/usaor/or.or131trktil.wpt:US101/OR6
OR/usaor/or.or131.wpt:US101/OR6
OR/usaor/or.or140.wpt:I-5/OR99
OR/usaor/or.or207.wpt:I-84/US30
OR/usaor/or.or211.wpt:OR99E/OR214
OR/usaor/or.or214.wpt:OR99E/OR211
OR/usaor/or.or320.wpt:I-84/US30/395
OR/usaus/or.us097.wpt:US97Bus/OR126
POL/eure/pol.e67.wpt:DW382/DW385
POL/poldk/pol.dk008klo.wpt:DW382/DW385
PRT/eure/prt.e801.wpt:A24/IP3
ROU/eure/rou.e60.wpt:DN1A/DNVO1K
ROU/eure/rou.e60.wpt:DJ100B/DJ101E
ROU/eure/rou.e81.wpt:A2/DN22C
ROU/roudj/rou.dj222.wpt:DN22/DN22D
ROU/roudj/rou.dj601e.wpt:DJ401A/DJ601
ROU/roudj/rou.dj793a.wpt:DJ792A/DJ793
ROU/roudn/rou.dn001.wpt:DJ100B/DJ101E
ROU/roudn/rou.dn001.wpt:DN1A/DNVO1K
ROU/roudn/rou.dn007.wpt:A1/DN73
RUS/asiah/rus.ah008.wpt:A181/ZSD
RUS/asiah/rus.ah008.wpt:A118/ZSD
RUS/eure/rus.e18.wpt:A181/ZSD
RUS/eure/rus.e18.wpt:A118/ZSD
TX/usai/tx.i369.wpt:US59/Lp151
TX/usatxl/tx.lp0012.wpt:TXLp354/Spr482
TX/usatxl/tx.lp0020.wpt:TX359/Spr260
TX/usatxl/tx.lp0340.wpt:TX6/Lp484
TX/usatxl/tx.lp0354.wpt:TXLp12/Spr482
TX/usatxl/tx.lp0484.wpt:TX6/Lp340
TX/usatxs/tx.sp0339.wpt:TX103/Lp287
TX/usatxs/tx.sp0450.wpt:TX302/Lp338
TX/usatx/tx.tx019.wpt:TX19Bus/Lp7
TX/usatx/tx.tx019.wpt:TX154/Lp301
TX/usatx/tx.tx021.wpt:TX95/Lp150
TX/usatx/tx.tx036.wpt:TX95/Lp363
TX/usatx/tx.tx036.wpt:TX53/Lp363
TX/usatx/tx.tx046.wpt:TX46Bus/Lp337
TX/usatx/tx.tx053.wpt:TX36/Lp363
TX/usatx/tx.tx095.wpt:TX21/Lp150
TX/usatx/tx.tx154.wpt:TX19/Lp301
TX/usatx/tx.tx158.wpt:TX158Bus/Lp250
TX/usaus/tx.us059.wpt:US59Bus/Lp20
TX/usaus/tx.us059.wpt:US59Bus/Lp224
TX/usaus/tx.us059.wpt:TX93/Lp151
TX/usaus/tx.us083.wpt:US83Bus/Lp322
TX/usaus/tx.us084.wpt:US83Bus/Lp322
TX/usaus/tx.us190.wpt:TX95/Lp363
TX/usaus/tx.us380.wpt:TX207/Lp46
UT/usaut/ut.ut186.wpt:300N/ColSt
UT/usaut/ut.ut186.wpt:300N/StaSt
UT/usaut/ut.ut227.wpt:200W/StaSt
WI/usausb/wi.us012busbar.wpt:US12/CRW
WI/usawi/wi.wi035.wpt:WI64/CRE
WI/usawi/wi.wi064.wpt:WI35/CRE
WY/usaus/wy.us026.wpt:1stSt/RamSt
WY/usaus/wy.us085.wpt:Rd128/CadRd
WY/usaus/wy.us287.wpt:1stSt/RamSt
WY/usawy/wy.wy159.wpt:TeaRd/Rd94
WY/usawy/wy.wy530.wpt:FR7/FR157

but I did peruse siteupdate.py ... for datacheckerrors.append which I believe is a comprehensive accounting of all the data checks.
Correct, datacheckerrors.append will show all the datachecks, even a couple that are commented out. The live ones are compiled all in one place for convenience here.

Generally, the waypoint label checks currently implemented don't test for consistency with the labeling style guidelines, but rather what I would consider more fundamental errors*,
...
[* for what it's worth, I may be alone in drawing a distinction here, and you may consider these checks to be the same kind of thing.]
...
such as more than one slash, unbalanced parentheses, and invalid characters.
I'd categorize more than one slash (LABEL_SLASHES) as more style guideline, but I getcha.

There are a few checks for consistency, e.g. using "Bus" instead of "BL" or "BS" for business interstates, or the LONG_UNDERSCORE checks (flags _Abcd, _AbcdN type stuff), but nothing comprehensive.
Other than what you mentioned, for style guide stuff we have LABEL_UNDERSCORES (too many) and NONTERMINAL_UNDERSCORE (slash after an underscore).
I may have a go at adding US_BANNER US_LETTER back in; that's easy. Edit: Done!
I've brainstormed some other potential datachecks, still mostly in regex form recorded here on the forum for now, but have implemented some of these as actual datachecks, either partially or fully, and left saved as a git stash. I just worry about dropping too many bombs on the datacheck page too fast, and generating fatigue and ill will... even if we should be labeling stuff right! ;)

It would probably be useful to produce public documentation on the exact rules that throw flags (or maybe this exists and I couldn't find it at first glance).
https://github.com/yakra/Web/issues/13 wasn't too helpful/visible; replaced with https://github.com/TravelMapping/Web/issues/403 .

conduct a cursory perusal of the results for SP and PP and look for files with more than two hits since you can't have more than two ends in a highway wpt file. What we find is that this nomenclature is widespread, and may merit a specific mention in the style guidelines.
"Road (not driveway/parking lot) to a national or state-level park, major airport, or popular tourist attraction." The (not driveway/parking lot) directive has more recently been relaxed in practice.
If a localized national park or tourist attraction is immediately served by the cross road, a truncated version of its name could also suffice.
Clarificaction on NP/PP/SP abbreviations may be more useful here than in the section on highway ends.
https://github.com/TravelMapping/Web/issues/404

Start by identifying some candidates for inclusion: find all the US primary waypoint labels that have a non-underscored suffix where 1 uppercase character is followed by 3 or more lowercase characters (38 results).
This will only look for 4-character strings; in the OP I was also interested in finding cases where a more standard 3-character string should have an underscore but lacks it; these will be much more common.
Beware false positives! Do the search in the command for all regions (*/*/*.wpt), and you'll see a FP right away, on AB748. Since we can't have two "AB748"s to use in list files, one of them becomes AB748Eds, hence we'll see labels like this sometimes.
• No underscore when the city abbrev is part of the highway's .list name.
• Underscore when appending a city suffix to disambiguate 2+ intersections with the same route, that would otherwise have the same label.

Weeding out false positives would be better handled as a datacheck, where we have access to info on other routes occupying that same waypoint's coordinates.
My search in the OP is pretty North America centric; it returns a lot of FPs for Italy, which has a number of frequently used banners that I didn't include in that big regex.

A number of the searches posted in the OP, I have eyes toward eventually implementing as datachecks where practicable.
« Last Edit: November 18, 2020, 06:21:57 pm by yakra »

Offline yakra

  • TM Collaborator
  • Hero Member
  • *****
  • Posts: 2930
  • Last Login:Today at 12:13:46 am
Re: Using grep to search highway data, waypoint labels, etc.
« Reply #19 on: November 18, 2020, 09:18:40 pm »
Who gets hurt by exit renumberings?

MA I-195 has been renumbered, sequential -> mileage-based.
What if I want to tag people
who maintain their .lists via GitHub, so GitHub can send them a notification?
I've got a list of the exit numbers in use that got moved to a new location... Where do I go from there?

cd $HOME/TravelMapping/UserData/list_files/

First, I set up my search terms. Change these around, and we can generalize the scripts below to other similar applications.
rg=MA
rte=I-195
root=ma.i195
labels='5 10 12 13 15 22'


A few more variables to simplify some of the shell commands:
r=$(echo -en "\r")
t=$(echo -en "\t")
d="[ $t]+"



Build a list of travelers using any of the specified labels:
AllBroken=\
$(for label in $labels; do
    grep -v '^#' * \
    | sed -r -e 's~#.*~~' -e "s~:[ $t]+~:~" -e "s~[ $t$r]+$~~" \
    | egrep -i ":$rg$d$rte$d$label$d.*|:$rg$d$rte$d.*$d$label$|:$rg$d$rte$d$label$d.*$d.*$d.*|:.*$d.*$d.*$d$rg$d$rte$d$label$"
  done | cut -f1 -d. | sort | uniq)

I can use the echo command to print out this list and slap it in the forum.


Who updates using GitHub? Which files have commits by someone other than Jim?
GitHub=\
$(for u in $AllBroken; do
    echo -en "$u\t"
    echo \
    $(git log $u.list \
      | grep -v Teresco \
      | grep -B 1 -m 1 'Author:' \
      | head -n 1 | cut -f2 -d' ')
  done | egrep "$t[0-9a-f]{40}$" | cut -f1)

This is a list of TM usernames. We still need to figure out GitHub usernames, so we'll set it aside for later.


Back up a step, convert the names to URLs, and visit their GitHub commit history at the last commit not by Jim:
for u in $AllBroken; do
  echo -en "$u\t"
  echo \
  $(git log $u.list \
    | grep -v Teresco \
    | grep -B 1 -m 1 'Author:' \
    | head -n 1 | cut -f2 -d' ')
done \
| egrep "$t[0-9a-f]{40}$" \
| sed -r "s~(.*)$t(.*)~https://github.com/TravelMapping/UserData/commits/\2/list_files/\1.list~" \
| xargs firefox

This opens up 24 browser tabs. From there I just clicked on the users' profiles & copied their names the old-fashioned way, but this could possibly be automated too...


Who's left over who updates by email, and could use a heads-up from Jim?
diff <(echo $AllBroken | tr ' ' '\n') <(echo $GitHub | tr ' ' '\n') | grep '<' | sed 's~< ~~'


Who's fine?
What if, for giggles, I want to look at the travels of people whose .list lines didn't get broken?
First, find everyone who marked off any segment of MA I-195...
AllListed=\
$(grep -v '^#' * \
  | sed -r -e 's~#.*~~' -e "s~:[ $t]+~:~" -e "s~[ $t$r]+$~~" \
  | egrep -i ":$rg$d$rte$d.*$d.*|:$rg$d$rte$d.*$d.*$d.*$d.*|:.*$d.*$d.*$d$rg$d$rte$d.*" \
  | cut -f1 -d. | sort | uniq)



Then filter out the broken .lists, and open up what's left in the HB.
diff <(echo $AllListed | tr ' ' '\n') <(echo $AllBroken | tr ' ' '\n') | grep '<' \
| sed -e 's~< ~~' -e 's~.*~https://travelmapping.net/hb/showroute.php?r=$root\&u=&~' \
| xargs firefox



Was there anyone with only concurrencies, not explicitly .listing MA I-195 itself?
We can search user logs for anyone with MA I-195 mileage, and filter out our previous list.
cd $HOME/TravelMapping/DataProcessing/siteupdate/python-teresco/logs/users
diff <(grep -m 1 "$rg $rte" * | cut -f1 -d. | sort) <(echo $AllListed | tr ' ' '\n') | grep '<' \
| sed -r "s~< (.*)~https://travelmapping.net/user/mapview.php?rg=$rg\&u=\1~" \
| xargs firefox
« Last Edit: November 19, 2020, 01:06:42 am by yakra »