I remember times when Wikipedia seemed a new thing (and the times with
no Wikipedia at all). Despite all the critique, sometimes reasonable,
Wikipedia have grown to be a fascinating source of “common sense” data
about almost everything.
Being a programmer and open data enthusiast, I’m also kinda fascinated
about how hard to access this data in automated way.
Yes, Wikipedia have an API,
and rather impressive one. Yet at some moment API leaves you with just
a flow of Wikitext in your hands: a text that definitely HAS repeating and
structured parts, which contains lots of data. Only it’s really hard
to extract it in reasonable way (who said “regexps”?..)
DBPedia tries to target this problem by converting (part of) Wikipedia
information into structured RDF data, yet interaction with it is kinda
complicated, and selection of data is very limited.
Several monthes ago I’ve started a project of Wikipedia client and parser
which should make Wikipedia (and, by the way, any other MediaWiki wiki,
like Wikiquote, or Doctor Who wiki)
to rich data source, available for information extraction.
That early May day the volume of work the project needs
seemed something between weekend and two… and here we are, at August 18.
Now we are talking!
Meet Infoboxer
As I’ve already said, Infoboxer
is a MediaWiki client and parser. It goes as simple as that:
What’s going on here?
Infoboxer receives page “Argentina” via Wikipedia API;
Params names we are using here (leader_name1 and leader_title1)
you can see in page source
or inspect on the fly:
Infoboxer, despite the name, is not only about infoboxes! You have
full page tree parsed and easily navigable:
There is plenty more features, all of them (hopefully) well-structured
and thorougly documented in project’s wiki
and API docs.
Using of all of it requires some preparation and studies (for example,
in most cases you need to understand what MediaWiki
templates are), yet, I
hope, it can be useful and usable tool.
What next?
For now, Infoboxer is able to extract single page (or several of them)
from any MediaWiki-powered site, parse them and provide you with
easily-navigable parse tree. You also can follow links between pages with
a great ease, like:
Yet many data-extraction capabilities (which API provides) are still
lacking. In nearest versions, Infoboxer will move towards things like
“pages from some category”, “pages on search request” and other
list-of-pages-alike functionality. Ideally, things like “list of capitals
of all world countries with their mayors and population” should be done
in as few Infoboxer statements as it could be imagined.
There’s plenty of room for further enchancement, cleanup and experiments.
That’s exciting.