Slightly more complicated: find all the waypoints in the US that start with I followed directly by a number. [Note: this is not quite restricted to the US as there are some non-US regions with highway directories that contain the string "usa", but it eliminates most regions and (currently) all false positives. See the code section in this post for a regex that matches only US states and territories.]
I have a git stash on my desktop implementing an INTERSTATE_NO_HYPHEN
datacheck in the C++ siteupdate program. Haven't tackled Python yet.
inline void Waypoint::interstate_no_hyphen(DatacheckEntryList *datacheckerrors)
{ if (route->system->country->first == "USA" && label.size() >= 2)
{ const char *c = label.data();
if (c[0] == 'T' && c[1] == 'o') c += 2;
if (c[0] == 'I' && isdigit(c[1]))
datacheckerrors->add(route, label, "", "", "INTERSTATE_NO_HYPHEN", "");
}
}
First, find all waypoints with a capital letter followed by 4 or more lowercase letters
We have automated checks which are reported and updated with every site update: http://travelmapping.net/devel/datacheck.php
I just wonder, why some are not detected, e.g.:
NLD\nlda\nld.a325.wpt:2:Elden http://www.openstreetmap.org/?lat=51.958363&lon=5.888801
NLD\nlda\nld.a325.wpt:3:BurgMatsl http://www.openstreetmap.org/?lat=51.950772&lon=5.878973
@yakra? Any idea, should I open a Github issue?
I implemented a LABEL_LONG_WORD datacheck on my own TM mirror (not currently online).
Didn't yet merge the changes back into the TravelMapping/DataProcessing:master -- as cvoight records, there are still plenty of results. (More below...)
Probably should though
, if we're now expecting to see this datacheck now, and not seeing it causes confusion.
executed from the hwy_data directory: Get-ChildItem -Include "*.wpt" -Recurse
...
Scanning this file, I filter the results: delete many odd results from ITA\ita.nsa (2).wpt; delete Minnesota subway lines (MN\usamnmt\BlueLine.wpt & MN\usamnmt\GreenLine.wpt);
How about
Get-ChildItem -Include "*/*/*.wpt" ?
Select-String -CaseSensitive -Pattern "[A-Z][a-z]{4,}"
Or even just
"[a-z]{4,}"all lines that contain a + [presumably waypoint alias] -- using Sublime Text, execute a regex search for ^.*\+.*$, select find all, and press delete twice;
We still want to search before the + sign though. Hence the
| cut -f1 -d' ' on Unix-like systems.
Sounds like you know this per later posts though.
Parenthesizing your Select-String query and appending .length allows you to find the total number of results. For example, to find the number of waypoint labels with a US route as the primary highway (31,265):
PS ..\hwy_data> (Get-ChildItem -Directory "*usa*" -Recurse | Get-ChildItem -Include "*.wpt" -Recurse | Select-String -Pattern "^US").length
31265
in Unix, we'd pipe the results to wc and do a line count:
grep '^US' */usa*/*.wpt | wc -lI like the discussion on searching for extraneous highway prefixes.
Did I experiment with a regex for this at some point?
OK, yes. But I hadn't hit the mark yet. Now working with
grep -v '^+' */*/*.wpt | cut -f1 -d' ' | egrep -iv ':\*?[A-Z]+/[A-Z]+$' | egrep ':.*/[A-Za-z\-]{2,}[0-9]'we get 123 results. I don't see anything that should be a false positive, other than arguably the Texas entries; I think the original idea behind those was to consider the "Lp" or "Spr" part of the designation & not part of the prefix, I.E. TX95 + TXLp363 = drop the 2nd "TX" = TX95/Lp363.
Whatever the case though, I don't like those and plan on fixing them.We can expand on this a little bit and drop the requirement for numeral(s) after the slash:
grep -v '^+' */*/*.wpt | cut -f1 -d' ' | grep ':.*[0-9]' | egrep -i ':.*/[A-Z\-]{2,}' | egrep -v '/[A-Z]{2}$'This brings some local road names, named roads, and lettered county routes with extraneous "CR" back into the mix. Looks like these wouldn't be FPs either:
http://travelmapping.net/devel/manual/wayptlabels.php#slashhttp://travelmapping.net/devel/manual/wayptlabels.php#dropnamedAL/usaal/al.al013.wpt:US72/ALT72
AL/usaal/al.al017.wpt:US72/ALT72
AL/usaus/al.us043.wpt:US72/ALT72
AL/usaus/al.us072.wpt:US43/ALT72
AL/usaus/al.us090.wpt:I10/US98_W
AL/usaus/al.us090.wpt:I10/US98_E
AL/usaus/al.us098.wpt:I10/US90_W
AL/usaus/al.us098.wpt:I10/US90_E
BC/canbc/bc.bc003.wpt:BC93/BC95
BC/canbc/bc.bc093.wpt:BC3/BC95
BC/canbc/bc.bc095.wpt:BC3/BC93
CA/usaush/ca.us099hisind.wpt:1stSt/Hef
CHE/eurtr/che.gts.wpt:H10/AutGalsIns
CO/usaco/co.co030.wpt:6th/HavSt
CO/usaus/co.us085.wpt:I-25/I-70
CZE/eure/cze.e442.wpt:I30/II613
CZE/eure/cze.e461.wpt:I43/II640
CZE/eure/cze.e461.wpt:I42/II640
CZE/eure/cze.e551.wpt:I34/II634
CZE/eure/cze.e65.wpt:MO/II243
DEU-BY/deub/deuby.b013.wpt:St2256/St2419
DEU-BY/deub/deuby.b022bam.wpt:St2271/St2450
DEU-BY/eurtr/deuby.romstr.wpt:B466/St2212
DEU-BY/eurtr/deuby.romstr.wpt:B300/St2051
DEU-BY/eurtr/deuby.romstr.wpt:B17/St2014
DEU-BY/eurtr/deuby.romstr.wpt:B23/St2059
DEU-BY/eurtr/deuby.romstr.wpt:B17/St2059
DEU-BY/eurtr/deuby.romstrwur.wpt:B8/St2300
DNK/eurtr/dnk.mraal.wpt:PR55/SR180
DNK/eurtr/dnk.mresb.wpt:PR11/SR175
DNK/eurtr/dnk.mresb.wpt:E20/PR24
DNK/eurtr/dnk.mresb.wpt:PR15/SR181
DNK/eurtr/dnk.mrnyk.wpt:PR9/SR297
DNK/eurtr/dnk.mrnyk.wpt:PR9/SR289
DNK/eurtr/dnk.mrode.wpt:PR16/SR563
DNK/eurtr/dnk.mrode.wpt:PR21/SR563
DNK/eurtr/dnk.mrode.wpt:PR52/SR451
DNK/eurtr/dnk.mrode.wpt:PR8/SR167
DNK/eurtr/dnk.mrsja.wpt:SR231/PR57
DNK/eurtr/dnk.mrsja.wpt:SR207/SR211
DNK/eurtr/dnk.mrsja.wpt:SR211/PR16
DNK/eurtr/dnk.mrsja.wpt:PR16/SR205
DNK/eurtr/dnk.mrsja.wpt:PR14/SR155
DNK/eurtr/dnk.mrsve.wpt:SR305/PR9
DNK/eurtr/dnk.mrvib.wpt:PR26/SR186
DNK/eurtr/dnk.mr.wpt:PR26/SR545
DNK/eurtr/dnk.mr.wpt:PR26/SR181
DNK/eurtr/dnk.mr.wpt:E39/SR597
DNK/eurtr/dnk.mr.wpt:E45/SR585
ESP-AN/espa/espan.ca035.wpt:AP4/CA32
ESP-CM/espa/espcm.ap041.wpt:A40/TO22
ESP-CM/espn/espcm.n403.wpt:TO21/CM40
ESP-MC/espmc/espmc.rm015.wpt:A7/MU30
FRA-GES/fragesd55/frages.d069455.wpt:N135/NVS
FRA-IDF/eure/fraidf.e15.wpt:A3/BlvdPer
FRA-IDF/eure/fraidf.e15.wpt:A6b/BlvdPer
FRA-IDF/eure/fraidf.e50.wpt:A6b/BlvdPer
FRA-IDF/eure/fraidf.e50.wpt:A4/BlvdPer
FRA-IDF/eure/fraidf.e5.wpt:A13/BlvdPer
FRA-IDF/eure/fraidf.e5.wpt:A6a/BlvdPer
IDN/asiahr/idn.ah152.wpt:N3/JTP
IND-PB/index/indpb.acexpy.wpt:NH5/NH7
IN/usain/in.in001.wpt:I-69/I-469
ISL/islth/isl.th041.wpt:TH45/TH429
ITA/eure/ita.e78.wpt:SS73/SS715
ITA/eure/ita.e80.wpt:A91/AGRA
ITA/eure/ita.e80.wpt:A24/AGRA
MA/usama/ma.ma002acam.wpt:US3/MA3
MEX-JAL/mexdn/mexjal.mex080d.wpt:MEX90D/GUA10D
MEX-JAL/mexdn/mexjal.mex090d.wpt:MEX80D/GUA10D
NC/usanc/nc.nc087.wpt:US311/NC770
NC/usanc/nc.nc308trkwin.wpt:US13/US17_S
NC/usaus/nc.us017.wpt:US13/US17BypWin_S
NC/usaus/nc.us158.wpt:US29Bus/NC87
NC/usaus/nc.us301.wpt:NC43/NC48_S
OH/usaoh/oh.oh012.wpt:OH115/OH189
OH/usaoh/oh.oh115.wpt:OH12/OH189
OH/usaoh/oh.oh189.wpt:OH12/OH115
OH/usaus/oh.us023.wpt:OH32/OH124
ON/canon/on.on058tho.wpt:ON20/RR20
OR/usaor/or.or010.wpt:US26/OR99W
OR/usaor/or.or019.wpt:I-84/US30
OR/usaor/or.or035.wpt:I-84/US30
OR/usaor/or.or038.wpt:I-5/OR99
OR/usaor/or.or039.wpt:US97/OR39Bus
OR/usaor/or.or074.wpt:I-84/US30
OR/usaor/or.or099e.wpt:I-84/US30
OR/usaor/or.or099.wpt:US199/OR238
OR/usaor/or.or099.wpt:OR99W/OR99E
OR/usaor/or.or131trktil.wpt:US101/OR6
OR/usaor/or.or131.wpt:US101/OR6
OR/usaor/or.or140.wpt:I-5/OR99
OR/usaor/or.or207.wpt:I-84/US30
OR/usaor/or.or211.wpt:OR99E/OR214
OR/usaor/or.or214.wpt:OR99E/OR211
OR/usaor/or.or320.wpt:I-84/US30/395
OR/usaus/or.us097.wpt:US97Bus/OR126
POL/eure/pol.e67.wpt:DW382/DW385
POL/poldk/pol.dk008klo.wpt:DW382/DW385
PRT/eure/prt.e801.wpt:A24/IP3
ROU/eure/rou.e60.wpt:DN1A/DNVO1K
ROU/eure/rou.e60.wpt:DJ100B/DJ101E
ROU/eure/rou.e81.wpt:A2/DN22C
ROU/roudj/rou.dj222.wpt:DN22/DN22D
ROU/roudj/rou.dj601e.wpt:DJ401A/DJ601
ROU/roudj/rou.dj793a.wpt:DJ792A/DJ793
ROU/roudn/rou.dn001.wpt:DJ100B/DJ101E
ROU/roudn/rou.dn001.wpt:DN1A/DNVO1K
ROU/roudn/rou.dn007.wpt:A1/DN73
RUS/asiah/rus.ah008.wpt:A181/ZSD
RUS/asiah/rus.ah008.wpt:A118/ZSD
RUS/eure/rus.e18.wpt:A181/ZSD
RUS/eure/rus.e18.wpt:A118/ZSD
TX/usai/tx.i369.wpt:US59/Lp151
TX/usatxl/tx.lp0012.wpt:TXLp354/Spr482
TX/usatxl/tx.lp0020.wpt:TX359/Spr260
TX/usatxl/tx.lp0340.wpt:TX6/Lp484
TX/usatxl/tx.lp0354.wpt:TXLp12/Spr482
TX/usatxl/tx.lp0484.wpt:TX6/Lp340
TX/usatxs/tx.sp0339.wpt:TX103/Lp287
TX/usatxs/tx.sp0450.wpt:TX302/Lp338
TX/usatx/tx.tx019.wpt:TX19Bus/Lp7
TX/usatx/tx.tx019.wpt:TX154/Lp301
TX/usatx/tx.tx021.wpt:TX95/Lp150
TX/usatx/tx.tx036.wpt:TX95/Lp363
TX/usatx/tx.tx036.wpt:TX53/Lp363
TX/usatx/tx.tx046.wpt:TX46Bus/Lp337
TX/usatx/tx.tx053.wpt:TX36/Lp363
TX/usatx/tx.tx095.wpt:TX21/Lp150
TX/usatx/tx.tx154.wpt:TX19/Lp301
TX/usatx/tx.tx158.wpt:TX158Bus/Lp250
TX/usaus/tx.us059.wpt:US59Bus/Lp20
TX/usaus/tx.us059.wpt:US59Bus/Lp224
TX/usaus/tx.us059.wpt:TX93/Lp151
TX/usaus/tx.us083.wpt:US83Bus/Lp322
TX/usaus/tx.us084.wpt:US83Bus/Lp322
TX/usaus/tx.us190.wpt:TX95/Lp363
TX/usaus/tx.us380.wpt:TX207/Lp46
UT/usaut/ut.ut186.wpt:300N/ColSt
UT/usaut/ut.ut186.wpt:300N/StaSt
UT/usaut/ut.ut227.wpt:200W/StaSt
WI/usausb/wi.us012busbar.wpt:US12/CRW
WI/usawi/wi.wi035.wpt:WI64/CRE
WI/usawi/wi.wi064.wpt:WI35/CRE
WY/usaus/wy.us026.wpt:1stSt/RamSt
WY/usaus/wy.us085.wpt:Rd128/CadRd
WY/usaus/wy.us287.wpt:1stSt/RamSt
WY/usawy/wy.wy159.wpt:TeaRd/Rd94
WY/usawy/wy.wy530.wpt:FR7/FR157
but I did peruse siteupdate.py ... for datacheckerrors.append which I believe is a comprehensive accounting of all the data checks.
Correct,
datacheckerrors.append will show all the datachecks, even a couple that are commented out. The live ones are compiled all in one place for convenience
here.
Generally, the waypoint label checks currently implemented don't test for consistency with the labeling style guidelines, but rather what I would consider more fundamental errors*,
...
[* for what it's worth, I may be alone in drawing a distinction here, and you may consider these checks to be the same kind of thing.]
...
such as more than one slash, unbalanced parentheses, and invalid characters.
I'd categorize more than one slash (LABEL_SLASHES) as more style guideline, but I getcha.
There are a few checks for consistency, e.g. using "Bus" instead of "BL" or "BS" for business interstates, or the LONG_UNDERSCORE checks (flags _Abcd, _AbcdN type stuff), but nothing comprehensive.
Other than what you mentioned, for style guide stuff we have LABEL_UNDERSCORES (too many) and NONTERMINAL_UNDERSCORE (slash after an underscore).
I may have a go at adding US_BANNER US_LETTER back in; that's easy. Edit: Done!I've
brainstormed some other potential datachecks, still mostly in regex form recorded here on the forum for now, but have implemented some of these as actual datachecks, either partially or fully, and left saved as a git stash. I just worry about dropping too many bombs on the datacheck page too fast, and generating fatigue and ill will... even if we
should be labeling stuff right!
It would probably be useful to produce public documentation on the exact rules that throw flags (or maybe this exists and I couldn't find it at first glance).
https://github.com/yakra/Web/issues/13 wasn't too helpful/visible; replaced with
https://github.com/TravelMapping/Web/issues/403 .
conduct a cursory perusal of the results for SP and PP and look for files with more than two hits since you can't have more than two ends in a highway wpt file. What we find is that this nomenclature is widespread, and may merit a specific mention in the style guidelines.
"Road (not driveway/parking lot) to a national or state-level park, major airport, or popular tourist attraction." The (not driveway/parking lot) directive has more recently been relaxed in practice.
If a localized national park or tourist attraction is immediately served by the cross road, a truncated version of its name could also suffice.Clarificaction on NP/PP/SP abbreviations may be more useful here than in the section on highway ends.
https://github.com/TravelMapping/Web/issues/404Start by identifying some candidates for inclusion: find all the US primary waypoint labels that have a non-underscored suffix where 1 uppercase character is followed by 3 or more lowercase characters (38 results).
This will only look for 4-character strings; in the OP I was also interested in finding cases where a more standard 3-character string should have an underscore but lacks it; these will be much more common.
Beware false positives! Do the search in the command for all regions (
*/*/*.wpt), and you'll see a FP right away, on
AB748. Since we can't have two "AB748"s to use in list files, one of them becomes
AB748Eds, hence we'll see labels like this sometimes.
• No underscore when the city abbrev is part of the highway's .list name.
• Underscore when appending a city suffix to disambiguate 2+ intersections with the same route, that would otherwise have the same label.Weeding out false positives would be better handled as a datacheck, where we have access to info on other routes occupying that same waypoint's coordinates.
My search in the OP is pretty North America centric; it returns a lot of FPs for Italy, which has a number of frequently used banners that I didn't include in that big regex.
A number of the searches posted in the OP, I have eyes toward eventually implementing as datachecks where practicable.