In our reference library it is important that we have a hierarchical catalogue of regions, where articles can be stored.
However places need arranging according to various (sometimes conflicting) classifications.
But a basic Country/Region/Place structure is what we are aiming for. But where to obtain such a structure?
Possible sources of geographical structure:
( https://stackoverflow.com/questions/9627654/where-can-i-get-a-list-of-countries-states-and-cities )
Wikipedia
Has lists of states, countries, cities.
But no readily extractable hierarchy. The Wikidata entities (Q items and P Properties) allow for relationships to be defined, but they are basically tags, with no enforced hierarchical structure of categories. Thus ad-hoc overlapping list pages develop over time as contributors attempt to categorise articles. The topic of generating ontologies from Wikidata has formed the basis of many academic dissertations, but none has entirely succeeded, since the starting data is incomplete or inconsistent.
wdtaxonomy is a tool that can query wikidata, eg.
wdtaxonomy Q11344 -P P31 -c
returns 162 elements (item Q11344=chemical_element, property P31=instance_of).
But if a property doesn’t exist, or isn’t consistently used when tagging items, this is of little help.
Or use WikiData SPARCQL endpoint:
# curl -X POST https://query.wikidata.org/sparql -H "Accept: text/csv" --data-urlencode query@elems.sparql
SELECT ?elementLabel ?symbol (xsd:integer(?atomicnumber) as ?an)
WHERE {
?element wdt:P31 wd:Q11344 ; # ?element is_an_instance_of chemical_element
wdt:P246 ?symbol ; # ?element has_element_symbol ?symbol
wdt:P1086 ?atomicnumber . # ?element has_atomic number ?atomicnumber
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?atomicnumber
Another nice example retrieves a list of 5,179 minerals:
# curl -X POST https://query.wikidata.org/sparql -H "Accept: text/csv" --data-urlencode query@minerals.sparql
SELECT ?mineralLabel (SAMPLE(?subclassLabel) AS ?subclassLabel)
WHERE {
?mineral wdt:P31 wd:Q12089225 ; # ?mineral is_an_instance_of mineral_species
wdt:P279 ?subclass . # ?mineral is_a_member_of ?subclass
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
?mineral rdfs:label ?mineralLabel .
?subclass rdfs:label ?subclassLabel .
}
}
GROUP BY ?mineralLabel
ORDER BY ?mineralLabel
Drupal Taxonomy Place
https://www.drupal.org/project/taxonomy_place
Interesting – populates as needed, optionally fetches Wikipedia descriptions. But Drupal specific.
United Nations Code for Trade and Transport Locations (UN/LOCODE)
http://www.unece.org/cefact/locode/welcome.html
111,487 places
2018-2 SubdivisionCodes.csv has carriage returns in some region fields.
But for many places, the UN/LOCODE list does not show the parent region.
GeoNames geographical database
http://download.geonames.org/export/dump/
Does not have UK historic counties, e,g,
UK(GB)/England/Tyne&Wear/Tynemouth is listed instead by the council borough it lies within:
UK(GB)/England/NorthTyneside/Tynemouth
But, each place does usually have country/region/sub-region/place (country/admin1/admin2/place).
By cross-referencing against countryInfo.txt, admin1Codes.txt, admin2Codes.txt, all features of type Place can be extracted from allCountries.txt.
There are additional administration sub-divisions, e.g.Tynemouth is within the admin3 parish of Seaton Valley. However this is represented by the old ONS Geography code, which maps to the newer 9 character geography code – neither are decodable using separate downloadable Geonames adminCodes text files, but the full country dumps do contain the AdminRegions as well as Places so cross-referencing can be achieved.
Nomenclature of Territorial Units for Statistics
https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics (NUTS regions)
Europe only, and for the UK is based on council boroughs, not counties.
What they all have in common (to greater or lesser extents), is being riddled with errors.
Some other UK lists:
https://www.townslist.co.uk/ £25-£50
http://gazetteer.org.uk/purchase.php £15
www.paulstenning.com/uk-towns-and-counties-list
https://en.wikipedia.org/wiki/GeoPlace should be better, but is only free to public sector and Ordnance Survey partners.
Making something that might work
As it appears the most complete freely available source, we settled on the GeoNames data.
An awk script can quickly be crafted to use associative arrays to build a location hierarchy from the component datasets. To some extent this can omit incorrect admin region references (e.g. Winlaton in Gateshead is not in admin3 Stanley; Stanley is in County Durham, not part of Gateshead – the abolishing of many civil parishes over time has presumably been the cause of many funnies in the UK admin3 regions).
When used to populate a Linux ext4 filesystem directory structure, this gives us over 4 million directories and consumes over 16GB of disk space (each directory, even if empty, reserves a minimum of 4096 bytes).
Which does slow down backups… And even deletions (N.B.
mkdir empty; rsync -Prd --delete empty/ regions/
is faster than
rm -rf regions
).
A Better Method, using data from OpenStreetMap
At least for Britain and Ireland (December 2021)
See https://osm.org and https://github.com/osmnames/osmnames
Our steps were:
- Download a 1.6GB current OSM dump using
wget https://download.geofabrik.de/europe/britain-and-ireland-latest.osm.pbf
osmnames/import_osm/mapping.yml (which controls the imposm stage) needs amending to remove mappings from housenumber (leaving polygon, point, relation, linestring)
We also have to add state_district and county as a mapping place type.And osmnames/prepare_data/set_place_ranks.sql needs amending to keep all the admin levels:
WHEN type IN ('continent', 'sea', 'ocean') THEN 2
WHEN type IN ('country') THEN 4
WHEN type IN ('country_region') THEN 2 * COALESCE(admin_level,3)
WHEN type IN ('state') THEN 2 * COALESCE(admin_level,4)
WHEN type IN ('state_district') THEN 2 * COALESCE(admin_level,5)
WHEN type IN ('county') THEN 2 * COALESCE(admin_level,6)
WHEN type IN ('non_metropolitan_district') THEN 2 * COALESCE(admin_level,7)
WHEN type IN ('city') THEN 15
WHEN type IN ('water', 'desert') THEN 16
WHEN type IN ('boundary') THEN 17
WHEN type IN ('island', 'bay', 'river') THEN 2 * COALESCE(admin_level,8.5)
WHEN type IN ('region', 'peak', 'volcano') THEN 2 * COALESCE(admin_level,9)
WHEN type IN ('town','civil_parish') THEN 2 * COALESCE(admin_level,9)
WHEN type IN ('administrative') THEN 2* COALESCE(admin_level,15)
WHEN type IN ('village','hamlet','municipality','district','city_district','unincorporated_area','borough', 'aerodrome') THEN 2 * COALESCE(admin_level,10.5)
WHEN type IN ('suburb','subdivision','isolated_dwelling','farm','locality','islet','mountain_pass','hill') THEN 2 * COALESCE(admin_level,11)
WHEN type IN ('neighbourhood', 'residential','reservoir','stream') THEN 2 * COALESCE(admin_level,11.5)
WHEN type IN ('motorway','trunk','primary','secondary','tertiary','unclassified','residential','road','living_street','raceway','construction','track','crossing','riverbank','canal', 'station') THEN 2 * COALESCE(admin_level,13)
WHEN type IN ('motorway_link','trunk_link','primary_link','secondary_link','tertiary_link','service','path','cycleway','steps','bridleway','footway','corridor','pedestrian') THEN 2 * COALESCE(admin_level,13.5)
WHEN type IN ('houses') THEN 2 * COALESCE(admin_level,14)
WHEN type IN (NULL) THEN 2 * COALESCE(admin_level,9.5)
ELSE 30
Use a docker container (see https://osmnames.readthedocs.io/en/latest/getting_started.html) to generate a 62MB TSV file of place names with hierarchies.
We sped things up with some Postgres tuning and by holding the database in RAM. The run took about 20 minutes on a Ubuntu 20.04 system with a 6-core Intel I5-10400 and used 48GB of RAM.Also osmnames/export_osmnames/functions.sql needs to recognise more types as a place:
'country_region','state_district','county','non_metropolitan_district',
'region','civil_parish','municipality','district', 'city_district',
'unincorporated_area','subdivision','locality','islet','boundary',
And we need to separate lines of the displayname (full address field) with something other than commas (we used the field-mark “pipe” symbol) so as to avoid issues with address lines with embedded commas:
retval.displayName := retval.displayName || '| ' || current_name;
Comment out the lines within osmnames/prepare_data/set_polygon_types.sql (which applies libpostal rules, unfortunately turning some village and town administrative boundaries into cities, breaking the place hierarchies).
(https://nominatim.openstreetmap.org/ui/search.html is very helpful explaining how OSMNames is calculating the full address for any given node or region, although it uses Nominatim so the rules are not identical)
- Filter down the TSV to remove unwanted items (farms etc.) – we used an awk script:
BEGIN {FS=OFS="\t"}
NR == 1 || ( ($16 == "gb" || $16 == "ie" || $16 == "im" || $16 == "gg" || $16 == "je") && ( ( $3 == "node" || $3 == "relation") && ($5 == "place" || ($5 == "multiple" && $6 != "isolated_dwelling" && $6 != "farm" )))) {
print $1, $3,$4,$5,$6,$7,$8, $12,$13,$14, $15,$16,$17,$18,$19,$20,$21,$22,$23
}
- Install a local copy of dbpedia (a virtuoso sparql endpoint)
https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git
Load dbpedia data using COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/latest-core
(there’s a fair bit of detail omitted from the above, including installing the Docker engine https://docs.docker.com/engine/install/ubuntu/ https://docs.docker.com/compose/install/ ) - Install a local copy of wikipedia (using kiwix)
sudo add-apt-repository ppa:kiwixteam/release
sudo apt install kiwix-tools
wget https://www.mirrorservice.org/sites/download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2021-03.zim
kiwix-serve --port=8088 --daemon /kiwix/wikipedia_en_all_maxi_2021-03.zim
Access via a URL such as http://ourserver:8088/wikipedia_en_all_maxi_2021-03/A/Newcastle_upon_Tyne - Build a directory structure to reflect the place hierarchy in the TSV (we used python code which extracts the displayname field, driven from gnu parallel to utilise all CPU cores – it queries the local dbpedia to get place attributes and the local wikipedia to grab a PDF page)
- Relocate 100 or so places and move misplaced top-level entries to holding areas using a hand-crafted shellscript.