NAME AnyEvent::Net::Curl::Queued - Moose wrapper for queued downloads via Net::Curl & AnyEvent VERSION version 0.006 SYNOPSIS #!/usr/bin/env perl package CrawlApache; use common::sense; use HTML::LinkExtor; use Moose; extends 'AnyEvent::Net::Curl::Queued::Easy'; after finish => sub { my ($self, $result) = @_; say $result . "\t" . $self->final_url; if ( not $self->has_error and $self->getinfo('content_type') =~ m{^text/html} ) { my @links; HTML::LinkExtor->new(sub { my ($tag, %links) = @_; push @links, grep { $_->scheme eq 'http' and $_->host eq 'localhost' } values %links; }, $self->final_url)->parse(${$self->data}); for my $link (@links) { $self->queue->prepend(sub { CrawlApache->new({ initial_url => $link }); }); } } }; no Moose; __PACKAGE__->meta->make_immutable; 1; package main; use common::sense; use AnyEvent::Net::Curl::Queued; my $q = AnyEvent::Net::Curl::Queued->new; $q->append(sub { CrawlApache->new({ initial_url => 'http://localhost/manual/' }) }); $q->wait; DESCRIPTION Efficient and flexible batch downloader with a straight-forward interface: * create a queue; * append/prepend URLs; * wait for downloads to end (retry on errors). Download init/finish/error handling is defined through Moose's method modifiers. MOTIVATION I am very unhappy with the performance of LWP. It's almost perfect for properly handling HTTP headers, cookies & stuff, but it comes at the cost of *speed*. While this doesn't matter when you make single downloads, batch downloading becomes a real pain. When I download large batch of documents, I don't care about cookies or headers, only content and proper redirection matters. And, as it is clearly an I/O bottleneck operation, I want to make as many parallel requests as possible. So, this is what CPAN offers to fulfill my needs: * Net::Curl: Perl interface to the all-mighty libcurl , is well-documented (opposite to WWW::Curl); * AnyEvent: the DBI of event loops. Net::Curl also provides a nice and well-documented example of AnyEvent usage (03-multi-event.pl); * MooseX::NonMoose: Net::Curl uses a Pure-Perl object implementation, which is lightweight, but a bit messy for my Moose-based projects. MooseX::NonMoose patches this gap. AnyEvent::Net::Curl::Queued is a glue module to wrap it all together. It offers no callbacks and (almost) no default handlers. It's up to you to extend the base class AnyEvent::Net::Curl::Queued::Easy so it will actually download something and store it somewhere. OVERHEAD Obviously, the bottleneck of any kind of download agent is the connection itself. However, socket handling and header parsing add a lots of overhead. The script eg/benchmark.pl compares AnyEvent::Net::Curl::Queued against several other download agents. Only AnyEvent::Net::Curl::Queued itself, AnyEvent::Curl::Multi and lftp support parallel connections; thus, forks are used to reproduce the same behaviour for the remaining agents. The download target is a local copy of the Apache documentation . URL/second LWP::UserAgent HTTP::Tiny HTTP::Lite AnyEvent::Net::Curl::Queued AnyEvent::Curl::Multi lftp wget LWP::UserAgent 23.2 -- -93% -95% -99% -99% -99% -100% HTTP::Tiny 348.6 1404% -- -19% -79% -80% -86% -97% HTTP::Lite 429.3 1753% 23% -- -74% -76% -83% -97% AnyEvent::Net::Curl::Queued 1648.1 7023% 373% 284% -- -7% -34% -88% AnyEvent::Curl::Multi 1769.8 7544% 408% 312% 7% -- -29% -87% lftp 2500.9 10711% 619% 483% 52% 41% -- -81% wget 13253.5 57145% 3705% 2989% 704% 649% 429% -- AnyEvent::Curl::Multi really has less overhead at the cost of very primitive queue manager (no retries and large queues waste too much RAM due to lack of lazy initialization). ATTRIBUTES cv AnyEvent condition variable. Initialized automatically, unless you specify your own. max Maximum number of parallel connections (default: 4; minimum value: 1). multi Net::Curl::Multi instance. queue "ArrayRef" to the queue. Has the following helper methods: * queue_push: append item at the end of the queue; * queue_unshift: prepend item at the top of the queue; * dequeue: shift item from the top of the queue; * count: number of items in queue. share Net::Curl::Share instance. stats AnyEvent::Net::Curl::Queued::Stats instance. timeout Timeout (default: 10 seconds). unique "HashRef" to store request unique identifiers to prevent repeated accesses. METHODS start() Populate empty request slots with workers from the queue. empty() Check if there are active requests or requests in queue. add($worker) Activate a worker. append($worker) Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the end of the queue. For lazy initialization, wrap the worker in a "sub { ... }", the same way you do with the Moose "default => sub { ... }": $queue->append(sub { AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' }) }); prepend($worker) Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the beginning of the queue. For lazy initialization, wrap the worker in a "sub { ... }", the same way you do with the Moose "default => sub { ... }": $queue->prepend(sub { AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' }) }); wait() Shortcut to "$queue->cv->recv". SEE ALSO * AnyEvent * Moose * Net::Curl * WWW::Curl * AnyEvent::Curl::Multi AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2011 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.