NAME WWW::Leech::Walker - small web content grabbing framework SYNOPSIS use WWW::Leech::Walker; my $walker = new WWW::Leech::Walker({ ua => new LWP::UserAgent(), url => 'http://example.tdl', parser => $www_leech_parser_params, state => {}, logger => sub {print shift()}, filter => sub{ my $urls = shift; my $walker_obj = shift; # ... filter urls return $urls }, processor => sub { my $data = shift; my $walker_obj = shift; # ... process grabbed data } }); $walker->leech(); DESCRIPTION WWW::Leech::Walker walks through a given website parsing content and generating structured data. Declarative interface makes Walker some sort of a framework. This module is designed to extract data from sites with particular structure: an index page (or any other provided as a root page) contains links to individual pages representing items that should be grabbed. Index page may also contain 'paging' links (e.g. http://exmple.tdl/?page=2) which lead to the page with similar structure. The closest example is a products category page with links to individual products and links to 'sub-pages'. All required parameters are set as constructor arguments. Other methods are used to start/stop the grabbing process and launch logger (see below). DETAILS new($params) $params must be a hashref providing all data required. ua LWP compatible user-agent object. url Starting url. parser Parameters for WWW::Leech::Parser state Optional user-filled value. Walker does not use it directly. State is passed to user callbacks instead. Defaults to empty hashref. logger Optional logging callback. Whenever something happens walker runs this subroutine passing message. filter Optional urls filtering callback. When walker gets a list of items-pages urls it passes that list to the filter subroutine. Walker expects it to return filtered list. Empty list is okay. processor This callback is launched after the individual item is parsed and converted to a hashref. This hashref is passed to the processor to be saved, or processed in some other way. next_page_link_post_process This optional callback allows user to alter next page url. Usually these urls look like 'http://example.tld/list?page=2' and no changes needed there. But sometimes such links are javascript calls like 'javascript:gotoPageNumber(2)'. The source url is passed as is before walker absolutizes it. Walker passes current page url as a third agument - this may be usefull for links like 'javascript:gotoNextPage()' Walker expects this callback to return a fixed url. leech() Starts the process. stop() Stops the process completely. By default walker keeps working untill there are links. Some sites may contain zillions of pages, while only first million is required. This method allows to stop at some point. See "CALLBACKS" section below. If walker is restarted with leech() method it will run as if it was newly created (still the 'state' is saved). log($message) Runs the 'logger' callback with $message argument. CALLBACKS Walker passes callback specific data as a first argument, itself as a second and some additional data as third if any. When grabbing large sites the grabbing process should be stopped at some point (if you don't need all the data of course). This example shows how to do it using state propery and stop() method: #.... state => {total_links_amount => 0}, filter => sub{ my $links = shift; my $walker = shift; if($walker->{'state'}->{'total_links_amount'} > 1_000_000 ){ $walker->log("Million of items grabbed. Enough."); $walker->stop(); return []; } $walker->{'state'}->{'total_links_amount'} += scalar(@$links); return $links; } #.... AUTHOR Dmitry Selverstov CPAN ID: JAREDSPB jaredspb@cpan.org COPYRIGHT This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The full text of the license can be found in the LICENSE file included with this module.