Reading Wikipedia on N900

N900 is a device with a lot of connectivity options and a very capable browser. With that, it’s a good Wikipedia reader out of the box. But not so if your connectivity is limited (you’re on top of a mountain, or roaming, or don’t have a data plan alltogether and there are no open wifi hotspots).

Since I got a great Christmas gift from Collabora I’ve been poking at a small toy application that could store and read Wikipedia articles offline. Since Wikipedia hosts a huge number of articles, and device capabilities are limited compared to a desktop PC, this posed an interesting (but not unsolvable) challenge for the weekend hack sessions.

The result is Mawire (Maemo Wikipedia Reader):

Mawire 0.1 - home screen

Mawire 0.1 - home screen

Mawire 0.1 - search results

Mawire 0.1 - search results

Mawire 0.1 - Article view

Mawire 0.1 - Article view

 

 

The application

Having worked on bits and pieces in Maemo 5, I knew my way around the Maemo 5 SDK and some of the APIs, but nevertheless the Developer’s Guide was of great help. I’ve also perused examples, code (such as browser launcher) and packaging from marnanel‘s raeddit application.

The application is a lightweight reader, so the aim is not to display complete article, but rather to show enough information for a quick check, and provide convenient way for the user to find more on the Wikipedia itself.

install mawire
At the moment you can download the application from here, or browse or download the source code from github. I hope to upload it to maemo-extras soon. If you’re reading this from your N900, you can install mawire automatically by clicking on the install icon. Warning: It’s an early development (alpha) version, only tested on my device so far, so proceed with caution and only if you know what you’re doing.

Since writing this blog post (but before I hit the Publish button), I’ve played around with portrait mode functionality, and released version 0.2 which has portrait mode support. If your keyboard is closed, mawire will switch to portrait mode. When you slide it open at any time (e.g. for typing search queries, or when you want to copy/paste part of the article), mawire will switch to landscape mode.

Data handling

Since Wikipedia (especially the English edition) has a huge number of articles, amount of text for a complete and comprehensive copy is just too large to conveniently use on the device. Recent enwiki dump contains more than 3 million articles and is almost 6GB of bzip2′d XML.

To minimise database size, and since I wasn’t trying to replicate complete Wikipedia functionality, I decided to strip the articles as much as possible, not only of markup (so only bold and emphasis are preserved), but also to only include content of a few first paragraphs – up to the first heading. The idea is that topic overview is probably outlined first, and then each paragraph expands the coverage (often also having a complete article of its own).

So, the reader only includes the overview and provides “Read more…” button that connects to Wikipedia proper. So the app can be used not only as offline reader, but as quick Wikipedia launcher by itself.

The other problem is number of articles and search performance. If database contains up to a few million articles, sequential searching through the database is extremely slow. Unfortunately, the version of SQlite3 shipped on N900 (and indeed in many Linux distros ATM) doesn’t support fast fulltext search (FTS3), which is ideal for mawire.

So, the application currently ships with its own copy of SQLite library with enabled FTS3 module. It’s installed in a private lib directory so it doesn’t clash with the OS version (similar to what Firefox on Linux distros does), and is only ever used by mawire.

The data itself was prepared by a Python program that:

  1. parses the XML dump (using expat),
  2. extracts, parses and strips Wikipedia markup (using a bunch of hacked regexps, as I haven’t found a suitable wikimedia markup parser – this is the weakest part of the program and I hope to improve it in the future),
  3. filters articles we want to exclude (special pages, lists of things, too short articles),
  4. compresses the article text (using zlib) and finally stores them to a SQlite3 database (using SQLalchemy).

The final step is manual, building of FTS3 index, and consists of two SQL statements. This is so it could be done separately, on a machine having SQLite with FTS3.

The program is included in mawire source code, so you can use it to create custom Wikimedia databases (or, indeed, database for any MediaWiki powered wiki).

install mawire-enwiki-small A database of selected articles (3000 most visited + featured + good + vital; about 14 thousand articles in total) from English Wikipedia (13.6MB). Database is installed in /opt so it doesn’t fill up your rootfs.

Complete English edition, as well as several other major language editions are also available, but I haven’t created Maemo packages for them. You can download them directly, put them on your device or MMC card and select from the application menu.

2011-04-02 update: Moved the software repository and the databases to S3, and updated the links in this post accordingly.

2 comments

  1. There is also aarddict, which has the full text of all articles, but bigger files.
    http://aarddict.org/