NAME
    WWW::Sitemapper - Create text, html and xml sitemap by scanning a web
    site.

VERSION
    version 1.110340

SYNOPSIS
    WWW::Sitemapper is meant to be subclassed by user:

        package MyWebSite::Map;
        use Moose;

        use base qw( WWW::Sitemapper );

        # define attributes for your class
        has 'restricted_pages' => (
            is => 'ro',
            isa => 'ArrayRef[RegexpRef]',
            default => sub {
                [
                    qr{^/cat/login},
                    qr{^/cat/events},
                    qr{\?_search_string=},
                ]
            },
        );

        # configuration options for WWW::Robot
        sub _build_robot_config {
            my $self = shift;

            return {
                NAME => 'MyRobot',
                EMAIL => 'me@domain.tld',
            };
        }

        # you need to provide a follow-url-test hook in your subclass
        sub url_test : Hook('follow-url-test') {
            my $self = shift;
            my ($robot, $hook_name, $uri) = @_;

            my $url = $uri->path_query;

            if ( $self->site->host eq $uri->host ) {
                for my $re ( @{ $self->restricted_pages } ) {
                    if ( $url =~ /$re/ ) {
                        return 0;
                    }
                }

                return 1;
            }

            return 0;
        }

        # you can add your own hooks as well
        sub run_till_first_auto_save : Hook('continue-test') {
            my $self = shift;
            my ($robot) = @_;

            if ( $self->run_started_time + $self->auto_save < DateTime->now ) {
                return 0;
            }
            return 1;
        }


        # as this is your class feel free to define your own methods
        sub ping_google {
            my $self    = shift;

            my $ua = LWP::UserAgent;
            return $ua->get( 'http://www.google.com/webmasters/sitemaps/ping',
                sitemap => $self->site .'google-sitemap.xml.gz'
            );
        }

    and then

        package main;

        my $mapper = MyWebSite::Map->new(
            site => 'http://mywebsite.com/',
            status_storage => 'sitemap.data',
            auto_save => 10,
        );

        $mapper->run;


        open(HTML, ">sitemap.html") or die ("Cannot create sitemap.html: $!");
        print HTML $mapper->html_sitemap;
        close(HTML);

        my $xml_sitemap = $mapper->xml_sitemap(
            priority => '0.7',
            changefreq => 'weekly;
        );

        $xml_sitemap->write('google-sitemap.xml.gz');

        # call your own method
        $mapper->ping_google();

    and while mapper is still running take a peek what has been mapped so
    far

        my $mapper = MyWebSite::Map->new(
            site => 'http://mywebsite.com/',
            status_storage => 'sitemap.data',
        );

        $mapper->restore_state();

        print $mapper->txt_sitemap();

ATTRIBUTES
  site
    Home page of the website to be mapped.

    isa: "tURI" in WWW::Sitemapper::Types.

  tree
    Tree structure of the web site.

    isa: WWW::Sitemapper::Tree.

    Note: each page is mapped only once, so if multiple pages are linking to
    the same page only the first will be counted as parent.

    Note: beware of pages serving same content under different URLs (eg.
    using different query string parameters) as it may lead to circular
    references. Besides this search engines will punish you for so called
    "duplicate content". Use your subroutine with "Hook('follow-url-test')"
    to restrict access to those pages.

  robot_config
    WWW::Robot configuration options.

    isa: "HashRef".

    You need to define in your subclass builder method *_build_robot_config*
    which needs to return a hashref. Most important options are:

    *   EMAIL

        Your e-mail address - in case someone wishes to complain about the
        behaviour of your robot.

        mandatory.

    *   DELAY

        Delay between each request in minutes.

        Default: *1*

    For more details and other options please see "ROBOT_ATTRIBUTES" in
    WWW::Robot.

  status_storage
    Path of status storage file to be used for saving the result of web
    crawl. If defined Storable will be used to store the current state.

    isa: "Str".

  auto_save
    Auto save current status every N minutes (defaults to 0 - do not auto
    save).

    isa: "tDateTimeDuration" in WWW::Sitemapper::Types.

    Note: "status_storage" has to be defined.

  run_started_time
    Time when "run" method was called.

    isa: "tDateTime" in WWW::Sitemapper::Types.

  html_sitemap_template
    Template-Toolkit html sitemap template to be used by helper method
    "html_sitemap".

    isa: "Str".

    Can be overriden by definining "_build_html_sitemap_template" in your
    subclass.

    Parameter passed to the template is the main object (*$self*) named as
    *mapper*.

    Default value:

        <html>
        <head>
        <title>Sitemap for [% mapper.site.host %]</title>
        </head>
        <body>
        <ul>
        [%- INCLUDE branch node = mapper.tree -%]
        </ul>
        </body>
        </html>

        [%- BLOCK branch -%]
        <li><a href="[% node.loc %]">[% node.title || node.loc %]</a>
        [%     IF node.children.size -%]
        <ul>
        [%-
                    FOREACH child IN node.children;
                        INCLUDE branch node = child;
                    END;
        -%]
        </ul>
        [%     END -%]
        </li>
        [% END -%]

METHODS
  run
        print $mapper->run();

    Creates a WWW::Robot object and starts to map the website specified by
    "site".

    Scans your subclass for methods with ":Hook('name-of-the-hook')"
    attributes to be added to robot object.

    You need to define at least one subroutine with *follow-url-test* hook
    which will be used to decide if the page should be followed and added to
    sitemap.

        sub url_test : Hook('follow-url-test') {
            my $self = shift;
            my ($robot, $hook_name, $uri) = @_;

            my $should_follow = ...

            return $should_follow;
        }

    Please see "SUPPORTED_HOOKS" in WWW::Robot for full list of supported
    hooks.

    Note: you can name your subroutines however you want and add other
    attributes as well - WWW::Sitemapper looks only for "Hook(...)" ones.

  txt_sitemap
        print $mapper->txt_sitemap();

    Create plain text sitemap. Example output:

        * http://mywebsite.com/
          * http://mywebsite.com/page1.html
            * http://mywebsite.com/page11.html
            * http://mywebsite.com/page12.html
          * http://mywebsite.com/page2.html

    Accepts following parameters:

    with_id => 0|1
            print $mapper->txt_sitemap( with_id => 1 );

        Use id of each node instead of ***.

        Defaults to 0.

    with_title => 0|1
            print $mapper->txt_sitemap( with_title => 1 );

        Add node title after node location.

        Defaults to 0.

  html_sitemap
        print $mapper->html_sitemap(%TT_CONF);

    Create HTML sitemap using template defined in "html_sitemap_template".

    Allows to specify Template-Toolkit configuration options, see
    "CONFIGURATION_SUMMARY" in Template.

  xml_sitemap
        my $sitemap = $mapper->xml_sitemap();

        # print xml
        print $sitemap->as_xml->sprint;

        # write to file

        $sitemap->write('sitemap.xml');

    Create XML sitemap <http://www.sitemaps.org>. Returns WWW::Sitemap::XML
    object.

    Accepts following parameters:

    *   split_by

            my @sitemaps = $mapper->xml_sitemap(
                split_by => [
                    '^/doc',
                    '^/cat',
                    '^/ila',
                ],
            );

        Arrayref of regular expressions used to split the final sitemap
        based on the page location - "loc" in WWW::Sitemapper::Tree. If this
        option is supplied the "xml_sitemap" will return an array of
        WWW::Sitemap::XML objects plus additional one for any urls not
        matched by conditions provided.

        Note: the first matching condition is used.

        Note: schema and hostname are remove from node uri for condition
        matching.

        Note: keys could be regexp or strings.

    *   priority

            my $sitemap = $mapper->xml_sitemap(
                priority => 0.6,
            );

        or

            my $sitemap = $mapper->xml_sitemap(
                priority => {
                    '^/doc/' => '+0.2', # same as 0.7
                    '^/ila/' => 0.4,
                    '^/cat/' => 0.9,
                    '^/$' => 1,
                },
            );

        or

            my $sitemap = $mapper->xml_sitemap(
                priority => [
                    { '^/doc/' => '+0.2' },
                    { '^/ila/' => 0.3    },
                    { '^/cat/' => 0.9    },
                    { '\.pdf$' => 0.8    }, # all pdfs 0.8 and in /doc/ 1.0
                ],
            );

        If priority is a scalar value it will be used as a default for all
        pages.

        Supports *relative* values which will be added/subtracted to/from
        final priority.

        If it is a hashref or arrayref all conditions are checked. In case
        of *relative* values all matching ones are combined and in case of
        *absolute* ones the last one is used - use arrayref to *chain* your
        conditions.

        Final priority will be set to 0.0 if the calculated one is negative.

        Final priority will be set to 1.0 if the calculated one is higher
        then 1.

        Default priority is 0.5.

        Note: schema and hostname are remove from node uri for condition
        matching.

        Note: keys could be regexp or string objects.

    *   changefreq

            my $sitemap = $mapper->xml_sitemap(
                changefreq => 'daily',
            );

        or

            my $sitemap = $mapper->xml_sitemap(
                changefreq => {
                    '^/doc/' => 'weekly',
                    '^/ila/' => 'yearly'
                    '^/cat/' => 'daily',
                    '^/$' => 'always',
                },
            );

        or

            my $sitemap = $mapper->xml_sitemap(
                changefreq => [
                    { '^/doc/' => 'weekly' },
                    { '^/ila/' => 'yearly' },
                    { '^/cat/' => 'daily'  },
                    { '^/$' => 'always'    },
                    { '\.pdf$' => 'never'  }, # pdfs will never change
                ],
            );

        If changefreq is a scalar value it will be used as a default for all
        pages.

        If it is a hashref or arrayref all conditions are checked and the
        last matching one is used - use arrayref to *chain* your conditions.

        Valid values are:

        *   always

        *   hourly

        *   daily

        *   weekly

        *   monthly

        *   yearly

        *   never

        Default changefreq is 'weekly'.

        Note: schema and hostname are remove from node uri for condition
        matching.

        Note: keys could be regexp or string objects.

HOOKED METHODS
  restore_state
        $mapper->restore_state();

    Restore state from "status_storage" using "retrieve" in Storable.

    Loads into current object "tree" and internal state of web robot.

    Uses hook "restore-state" in WWW::Robot.

  save_state
        $mapper->save_state();

    Save into "status_storage" using "store" in Storable current content of
    "tree" and internal state of web robot.

    Uses hook "save-state" in WWW::Robot.

AUTHOR
    Alex J. G. Burzyński <ajgb@cpan.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2010 by Alex J. G. Burzyński
    <ajgb@cpan.org>.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.