Parse-MediaWikiDump Parse::MediaWikiDump is a collection of classes for processing various MediaWiki dump files such as those at http://download.wikimedia.org/wikipedia/en/; the package requires XML::Parser. Using this software it is nearly trivial to get access to the information in supported dump files. Currently the following dump files are supported: * Current page dumps for all languages * Current links dumps for all languages INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install LIMITATIONS Parse::MediaWikiDump currently can not properly handle the full page dumps (a dump where each page has more than one revision). In this instance Parse::MediaWikiDump will abort processing of the archive. Parse::MediaWikiDump is not as fast as it could be but it is faster than using most other XML parsing frameworks. The parser could stand to be rewritten to be faster and handle the full page dumps. EXAMPLE Extract the text for a given article from the given dump file: #!/usr/bin/perl use strict; use warnings; use Parse::MediaWikiDump; my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages"; my $title = shift(@ARGV) or die "must specify an article title"; my $dump = Parse::MediaWikiDump::Pages->new($file); binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); #this is the only currently known value but there could be more in the future if ($dump->case ne 'first-letter') { die "unable to handle any case setting besides 'first-letter'"; } #enforce the MediaWiki case rules $title = case_fixer($title); #iterate over the entire dump file, article by article while(my $page = $dump->next) { if ($page->title eq $title) { print STDERR "Located text for $title\n"; my $text = $page->text; print $$text; exit 0; } } print STDERR "Unable to find article text for $title\n"; exit 1; #removes any case sensativity from the very first letter of the title #but not from the optional namespace name sub case_fixer { my $title = shift; #check for namespace if ($title =~ /^(.+?):(.+)/) { $title = $1 . ':' . ucfirst($2); } else { $title = ucfirst($title); } return $title; } COPYRIGHT & LICENSE Copyright 2005 Tyler Riddle, all rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.