Sitemapper Version 1.008 ======================== Description ----------- sitemapper.pl is a simple perl script which generated an HTML site map from a given URL. It does this by traversing the site, getting the home page, extracting links from it, getting all the pages linked, and so on. The default sitemap generated is an HTML bulleted list. The first level indented list item is the home page; the next level are all the pages linked from the home page. The next level are all the pages linked from each of these pages, and so on. If a page is linked from more than one page, it is show in the "highest" place in the tree it is linked from. Alternative sitemap formats are: * a dynamic HTML version (see below) which generates a collapsable folding tree. * a text version, which generates a simple formated text file * an XML graph version, which prints out all the URLs and links in the site in an XML format sitemapper.pl should correctly deal with framesets, client side image maps, and tags. It ignores all "off site" links - i.e. all absolute URLs that do not start with the original "base" URL of the home page. Modules ------- sitemapper.pl includes two modules that it requires in its distribution: WWW::Sitemap LWP::AuthenAgent WWW::Sitemap is the module that is used to generate the sitemap structure from which the various output formats are generated. The interface provides access to list of URLs for a site, and links from each of these URLs. It also supports a traverse method, which allows the caller to specify a callback, so that other formats of sitemap can be generated, or other sitemap related functionality implemented. See the documentation of this module for more details. LWP::AuthenAgent is a simple subclass of the LWP::UserAgent module, which allows requests to be made for URLs that require autentication, by requiring the user to type the username / password information for the relevant realm. This information is stored in the LWP::AuthenAgent object, so that repeated requests to the same realm can be made without re-typing the authenication details (a bit like a web browser, in fact). tty echo is switched off for the password. Installation ------------ Just the basic Makefile.PL stuff; i.e.: > perl Makefile.PL > make > make test > make install Usage ----- To use sitemapper.pl, just type: ./sitemapper.pl -url http://www.mysite.com/ to get output to stdout, or ./sitemapper.pl -url http://www.mysite.com/ -output mysitemap.html to output to a file. Type ./sitemapper.pl -help to get full usage instructions, or .sitemapper.pl -doc to output the pod documentation Examples -------- example.html contains an example of sitemapper.pl output, for the Canon Research Europe Ltd Perl Pages (http://www.cre.canon.co.uk/perl/); i.e. by running: ./sitemapper.pl -o example.html -url http://www.cre.canon.co.uk/ example.js.html contains an example of a dynamic HMTL version of the site map for the CRE site. This is generated using Jef Pearlman's (jef@mit.edu) javascript Tree class. http://developer.netscape.com/docs/examples/dynhtml/tree.html Many thanks to Jef for allowing this to be distributed with sitemapper.pl! This is generated by running: ./sitemapper.pl -o example.js.html -url http://www.cre.canon.co.uk/ -format js exampl.xml contains the output from: ./sitemapper.pl -o example.xml -url http://www.cre.canon.co.uk/ -format xml The XML format for this file is pretty ad hoc - probably not of interest to anyone apart from me! Finally, a plain text version can be generated using the -format text option; for example: ./sitemapper.pl -o example.txt -url http://www.cre.canon.co.uk/ -format text CPAN Modules ------------ sitemapper.pl uses the following CPAN modules, that need to be installed before it will work: WWW::Robot HTML::Summary Digest::MD5 Date::Format Getopt::Long HTML::Entities IO::File LWP::UserAgent URI::URL Term::ReadKey See http://www.perl.com/CPAN/ for details of how to download / install these modules. Bugs ---- Please send any bugs / comments / suggestions to wrigley@cre.canon.co.uk