A Taxonomy of Places

In our reference library it is important that we have a hierarchical catalogue of regions, where articles can be stored.

However places need arranging according to various (sometimes conflicting) classifications.
But a basic Country/Region/Place structure is what we are aiming for. But where to obtain such a structure?

Possible sources of geographical structure:
( https://stackoverflow.com/questions/9627654/where-can-i-get-a-list-of-countries-states-and-cities )

Wikipedia

 

Has lists of states, countries, cities. 
But no readily extractable hierarchy. The Wikidata entities (Q items and P Properties) allow for relationships to be defined, but they are basically tags, with no enforced hierarchical structure of categories. Thus ad-hoc overlapping list pages develop over time as contributors attempt to categorise articles. The topic of generating ontologies from Wikidata has formed the basis of many academic dissertations, but none has entirely succeeded, since the starting data is incomplete or inconsistent.

wdtaxonomy is a tool that can query wikidata, eg.
wdtaxonomy Q11344 -P P31 -c
returns 162 elements  (item Q11344=chemical_element, property P31=instance_of).
But if a property doesn’t exist, or isn’t consistently used when tagging items, this is of little help.

Or use WikiData SPARCQL endpoint:

# curl -X POST https://query.wikidata.org/sparql -H "Accept: text/csv" --data-urlencode query@elems.sparql

SELECT ?elementLabel ?symbol (xsd:integer(?atomicnumber) as ?an)
WHERE {
  ?element wdt:P31 wd:Q11344 ; # ?element is_an_instance_of chemical_element
  wdt:P246 ?symbol ; # ?element has_element_symbol ?symbol
  wdt:P1086 ?atomicnumber . # ?element has_atomic number ?atomicnumber
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?atomicnumber

Another nice example retrieves a list of 5,179 minerals:

# curl -X POST https://query.wikidata.org/sparql -H "Accept: text/csv" --data-urlencode query@minerals.sparql

SELECT ?mineralLabel (SAMPLE(?subclassLabel) AS ?subclassLabel)
WHERE {
  ?mineral wdt:P31 wd:Q12089225 ; # ?mineral is_an_instance_of mineral_species
           wdt:P279 ?subclass . # ?mineral is_a_member_of ?subclass
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
    ?mineral rdfs:label ?mineralLabel .
    ?subclass rdfs:label ?subclassLabel .
  }
}
GROUP BY ?mineralLabel
ORDER BY ?mineralLabel

Drupal Taxonomy Place

https://www.drupal.org/project/taxonomy_place
Interesting – populates as needed, optionally fetches Wikipedia descriptions. But Drupal specific.

United Nations Code for Trade and Transport Locations (UN/LOCODE)

http://www.unece.org/cefact/locode/welcome.html
111,487 places
2018-2 SubdivisionCodes.csv has carriage returns in some region fields.
But for many places, the UN/LOCODE list does not show the parent region.

GeoNames geographical database

http://download.geonames.org/export/dump/
Does not have UK historic counties, e,g, 
UK(GB)/England/Tyne&Wear/Tynemouth is listed instead by the council borough it lies within:
UK(GB)/England/NorthTyneside/Tynemouth
But, each place does usually have country/region/sub-region/place  (country/admin1/admin2/place).
By cross-referencing against countryInfo.txt, admin1Codes.txt, admin2Codes.txt, all features of type Place can be extracted from allCountries.txt. 
There are additional administration sub-divisions, e.g.Tynemouth is within the admin3 parish of Seaton Valley. However this is represented by the old ONS Geography code, which maps to the newer 9 character geography code – neither are decodable using separate downloadable Geonames adminCodes text files, but the full country dumps do contain the AdminRegions as well as Places so cross-referencing can be achieved.

Nomenclature of Territorial Units for Statistics

https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics (NUTS regions)
Europe only, and for the UK is based on council boroughs, not counties.

 

 

 

 

What they all have in common (to greater or lesser extents), is being riddled with errors.

Some other UK lists:
https://www.townslist.co.uk/ £25-£50
http://gazetteer.org.uk/purchase.php £15
www.paulstenning.com/uk-towns-and-counties-list
https://en.wikipedia.org/wiki/GeoPlace should be better, but is only free to public sector and Ordnance Survey partners.

Making something that might work

As it appears the most complete freely available source, we settled on the GeoNames data.
An awk script can quickly be crafted to use associative arrays to build a location hierarchy from the component datasets. To some extent this can omit incorrect admin region references (e.g. Winlaton in Gateshead is not in admin3 Stanley;  Stanley is in County Durham, not part of Gateshead – the abolishing of many civil parishes over time has presumably been the cause of many funnies in the UK admin3 regions).
When used to populate a Linux ext4 filesystem directory structure, this gives us over 4 million directories and consumes over 16GB of disk space (each directory, even if empty, reserves a minimum of 4096 bytes).
Which does slow down backups… And even deletions (N.B.
mkdir empty; rsync -Prd --delete empty/ regions/
is faster than
rm -rf regions ).

A Better Method, using data from OpenStreetMap

At least for Britain and Ireland (December 2021)

See https://osm.org and https://github.com/osmnames/osmnames

Our steps were:

  1. Download a 1.6GB current OSM dump using wget https://download.geofabrik.de/europe/britain-and-ireland-latest.osm.pbf
  2. osmnames/import_osm/mapping.yml (which controls the imposm stage) needs amending to remove mappings from housenumber (leaving polygon, point, relation, linestring)
    We also have to add state_district and county as a mapping place type.

    And osmnames/prepare_data/set_place_ranks.sql needs amending to keep all the admin levels:

    WHEN type IN ('continent', 'sea', 'ocean') THEN 2
    WHEN type IN ('country') THEN 4
    WHEN type IN ('country_region') THEN 2 * COALESCE(admin_level,3)
    WHEN type IN ('state') THEN 2 * COALESCE(admin_level,4)
    WHEN type IN ('state_district') THEN 2 * COALESCE(admin_level,5)
    WHEN type IN ('county') THEN 2 * COALESCE(admin_level,6)
    WHEN type IN ('non_metropolitan_district') THEN 2 * COALESCE(admin_level,7)
    WHEN type IN ('city') THEN 15
    WHEN type IN ('water', 'desert') THEN 16
    WHEN type IN ('boundary') THEN 17
    WHEN type IN ('island', 'bay', 'river') THEN 2 * COALESCE(admin_level,8.5)
    WHEN type IN ('region', 'peak', 'volcano') THEN 2 * COALESCE(admin_level,9)
    WHEN type IN ('town','civil_parish') THEN 2 * COALESCE(admin_level,9)
    WHEN type IN ('administrative') THEN 2* COALESCE(admin_level,15)
    WHEN type IN ('village','hamlet','municipality','district','city_district','unincorporated_area','borough', 'aerodrome') THEN 2 * COALESCE(admin_level,10.5)
    WHEN type IN ('suburb','subdivision','isolated_dwelling','farm','locality','islet','mountain_pass','hill') THEN 2 * COALESCE(admin_level,11)
    WHEN type IN ('neighbourhood', 'residential','reservoir','stream') THEN 2 * COALESCE(admin_level,11.5)
    WHEN type IN ('motorway','trunk','primary','secondary','tertiary','unclassified','residential','road','living_street','raceway','construction','track','crossing','riverbank','canal', 'station') THEN 2 * COALESCE(admin_level,13)
    WHEN type IN ('motorway_link','trunk_link','primary_link','secondary_link','tertiary_link','service','path','cycleway','steps','bridleway','footway','corridor','pedestrian') THEN 2 * COALESCE(admin_level,13.5)
    WHEN type IN ('houses') THEN 2 * COALESCE(admin_level,14)
    WHEN type IN (NULL) THEN 2 * COALESCE(admin_level,9.5)
    ELSE 30

    Use a docker container (see https://osmnames.readthedocs.io/en/latest/getting_started.html) to generate a 62MB TSV file of place names with hierarchies. 
    We sped things up with some Postgres tuning and by  holding the database in RAM. The run took about 20 minutes on a Ubuntu 20.04 system with a 6-core Intel  I5-10400 and used 48GB of RAM. 

    Also osmnames/export_osmnames/functions.sql needs to recognise more types as a place:
    'country_region','state_district','county','non_metropolitan_district',
    'region','civil_parish','municipality','district', 'city_district',
    'unincorporated_area','subdivision','locality','islet','boundary',
    And we need to separate lines of the displayname (full address field) with something other than commas (we used the field-mark “pipe” symbol) so as to avoid issues with address lines with embedded commas:
    retval.displayName := retval.displayName || '| ' || current_name;

    Comment out the lines within osmnames/prepare_data/set_polygon_types.sql (which applies libpostal rules, unfortunately turning some village and town administrative boundaries into cities, breaking the place hierarchies).

    (https://nominatim.openstreetmap.org/ui/search.html is very helpful explaining how OSMNames is calculating the full address for any given node or region, although it uses Nominatim so the rules are not identical)

  3. Filter down the TSV  to remove unwanted items (farms etc.) – we used an awk script:
    BEGIN {FS=OFS="\t"}
    NR == 1 || ( ($16 == "gb" || $16 == "ie" || $16 == "im" || $16 == "gg" || $16 == "je") && ( ( $3 == "node" || $3 == "relation") && ($5 == "place" || ($5 == "multiple" && $6 != "isolated_dwelling" && $6 != "farm" )))) {
    print $1, $3,$4,$5,$6,$7,$8, $12,$13,$14, $15,$16,$17,$18,$19,$20,$21,$22,$23
    }
  4. Install a local copy of dbpedia (a virtuoso sparql endpoint)
    https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git
    Load dbpedia data using COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/latest-core
    (there’s a fair bit of detail omitted from the above, including installing the Docker engine https://docs.docker.com/engine/install/ubuntu/ https://docs.docker.com/compose/install/ )
  5. Install a local copy of wikipedia (using kiwix)
    sudo add-apt-repository ppa:kiwixteam/release
    sudo apt install kiwix-tools
    wget https://www.mirrorservice.org/sites/download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2021-03.zim
    kiwix-serve --port=8088 --daemon /kiwix/wikipedia_en_all_maxi_2021-03.zim
    Access via a URL such as http://ourserver:8088/wikipedia_en_all_maxi_2021-03/A/Newcastle_upon_Tyne
  6. Build a directory structure to reflect the place hierarchy in the TSV (we used python code which extracts the displayname field, driven from gnu parallel to utilise all CPU cores – it queries the local dbpedia to get place attributes and the local wikipedia to grab a PDF page)
  7. Relocate 100 or so places and move misplaced top-level entries to holding areas using a hand-crafted shellscript.