Creating a Local Copy of Wikipedia

Whilst working on a project that required us to extract information from Wikipedia, we recognised the need to host our own local copy so as to avoid stressing the main Wikipedia servers with API calls.

Options

 

  • Kiwix – offline-reader – no API
  • XOWA – Dynamic HTML generation from a local XML database dump – no API
  • DBpedia – SPARQL – some of the data from Wikipedia, but not all
  • Dump loaded into MediaWiki instance running on a pair of docker containers (MySQL/MariaDB and Apache2/PHP)
  • Using pywikibot, pagefromfile.py and Nokogiri – not explored

MediaWiki instance running on a pair of docker containers

Note: this was undertaken in July 2021, using Ubuntu 20.04.2, MediaWiki 1.36, Apache 2.4.38, PHP 7.4.21, MariaDB 10.5.11
Much trial and error showed that many of the examples to be found on various websites (including Wikipedia and MediaWiki themselves) utilise code and methods (e.g. MWDumper) which no longer work (due to the MediaWiki schema changing, amongst other things).

Install Docker Engine:

(Ref: https://docs.docker.com/engine/install/ubuntu/ )
sudo apt install apt-transport-https ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt install docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker simon
docker run hello-world
sudo systemctl enable docker.service
sudo systemctl enable containerd.service

Install Docker Compose:

( Ref: https://docs.docker.com/compose/install/ )
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version

Create Apache and MariaDB containers for MediaWiki

mkdir /nn/wiki ; cd /nn/wiki

vi docker-compose.yml :

# Based on https://www.mediawiki.org/wiki/Docker/Hub
version: '3.2'
services:
  web:
    image: mediawiki
    ports:
      - 8080:80
    links:
      - database
    volumes:
      - images:/var/www/html/images
      #- ./LocalSettings.php:/var/www/html/LocalSettings.php
      - ./data/dumps:/datadumps:ro
  database:
    image: mariadb
    command: --max-allowed-packet=256M
    environment:
      MYSQL_DATABASE: 'wikipedia'
      MYSQL_USER: 'wikipedia'
      MYSQL_PASSWORD: 'wikipedia'
      MYSQL_ROOT_PASSWORD: 'wikipedia'
    volumes:
      - database:/var/lib/mysql
volumes:
  database:
  images:

docker-compose up

Browse to http://serveraddress:8080/ (which redirects to port 80 of the WikiMedia Apache container) and run MediaWiki setup, selecting:

Database host: database (note: do not accept the default value of localhost)
Database name: wikipedia
Database username: wikipedia
Database password: wikipedia
Administrator Account username: wikipedia
Administrator Account password: wikipediaw

Ensure “Ask me more questions” is selected, then on the next page:

Defaults will be OK, but under Extensions ensure the following are enabled:

Parser hooks – all
API – PageImages
Other – Gadgets, TextExtracts

Do not Enable Instant Commons (which would slow down imports to a crawl).

Copy LocalSettings.php from download to /nn/wiki/
docker-compose down
Uncomment LocalSettings.php volume line in docker-compose.yml
docker-compose up

Add TemplateStyles extension into the WikiMedia container:
Download from https://www.mediawiki.org/wiki/Special:ExtensionDistributor/TemplateStyles and place in /nn/wiki/data/dumps/
Start a command line session within the container and copy the extension into place.
docker exec -it wiki_web_1 bash
cd /var/www/html/extensions
tar -xzf /datadumps/TemplateStyles-REL1_36-e548bf1.tar.gz

Enable the extension by adding wfLoadExtension( 'TemplateStyles' ); into /nn/wiki/LocalSettings.php, then execute docker-compose restart (don’t docker-compose down/up, which will recreate the containers and lose the extension you just added).

At any point, if you need to clear out all containers and restart from scratch:
docker-compose rm -sv  # stop and remove containers
docker volume rm wiki_database wiki_images  # remove associated data volumes

Import Simple Wikipedia data into MediaWiki

The full simple English Wikipedia has 191,402 content pages, but a total of 627,956 pages (incl. Talk, redirects…). We will work with an XML dump extract which omits the non-essential pages – simplewiki-latest-pages-articles.xml. Each template, infobox etc. is stored within WikiMedia as a page in its own right, hence why the ‘content only’ extract still has more pages than the 191,402 reported. 

This XML consists of  a <mediawiki … > tag, encapsulating a <siteinfo></siteinfo> block followed by thousands of <page>…</page> sections. Thus we can easily chop this into smaller chunks (each with identical <mediawiki><siteinfo> blocks at the top) for more manageable data loading. 

cd /nn/data/dumps/
wget https://dumps.wikimedia.your.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
pbzip2 -d simplewiki-latest-pages-articles.xml.bz2
grep -c "<page>" simplewiki-latest-pages-articles.xml 
      # returns no. of pages to be loaded (approx. 352,000)

docker exec -it wiki_web_1 bash
cd maintenance
php importDump.php --conf ../LocalSettings.php /datadumps/simplewiki-latest-pages-articles.xml --username-prefix=""

We saw the load start at about 14 pages per second, increasing to 25 per second, then ultimately (24 hours later) degrading to less than 3 per second.

This should result in a loaded Simple Wikipedia (albeit with some page content errors due to various missing functionality).

We can allow subsequent on-the-fly loading of images (from wikimedia.org) by editing /nn/wiki/LocalSettings.php and changing $wgUseInstantCommons = false; to true;  (followed by  docker-compose restart)

To set the Main Page correctly, log in to http://serveraddress:8080/ as user “wikipedia” (password:”wikipediaw”), view the history (tab at top right), click on the date of the revision you wish to select, then click on change (tab at top right) and save changes.

Issues

(Ref: https://en.wikipedia.org/wiki/Wikipedia:Database_download )

Disk Space

(Even without images, revision history, talk pages etc…)

https://dumps.wikimedia.your.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
Simple wiki – 191,000 content pages, 191MB compressed dump, expands to 890MB – 352,000 pages to load into MediaWiki

https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
Full wiki – 6,338,000 content pages, 18.4GB compressed dump, expands to 80GB – 21,344,000 pages to load into MediaWiki

Time to load

(Even without images)

We used a 6-core (12-thread) Intel I5-10400 with 64GB DDR4 RAM, running Ubuntu 20.04.2 on an NVMe SSD.
But the simple Wikipedia took 24 hours to load. Given the continual slow-down experienced during the load, we can expect a full Wikipedia dump to take more than 6 months to load via this method.

The dump can be chopped up, but ImportDump.php is CPU bound (the PHP process only uses a single core).