================================= Script "gen_tree" Version 1.0 ================================= ----------------------------- Script for Perl 5.002 ----------------------------- (should work with later versions of Perl as well) Legal stuff: ------------ Copyright (c) 1996 by Steffen Beyer. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Requirements: ------------- Perl version 5.002 or higher. Compatibility of your web pages with the Apache HTTP server. (Concerning the syntax of server side includes and server side image maps) What does it do: ---------------- This script scans the tree (better: the directed graph) of HTML pages of a web site. (It's not always a tree because circles are possible!) It starts at the home page of that site (called the "root page" here) and follows all hyperlinks in a recursive descent (width first). (You can also scan just a subtree of your web site if you want) Since it scans files in the file system of the host bearing the web site, it is confined to pages lying physically on one host (!). The web server (HTTP daemon) of the web site is NOT used at all (!). Circles are recognized through unique identification of each page by the device and inode numbers of its corresponding file. Therefore, this script is confined to UNIX hosts or hosts where the device and inode numbers returned by "stat" serve the same purpose as with UNIX. One could abandon this latter restriction if one used checksums for identification instead. This is not 100% reliable, however. When scanning of the web site is complete, an HTML page is generated which contains all the pages found in form of one hyperlink to each of them. The tree structure of the web site is reflected in this page by the indentation of these hyperlinks. The text which is displayed in these hyperlinks is extracted from the ... tags inside the corresponding page. Supported features: ------------------- This script is capable of executing server side includes and of analyzing server side image maps (client side image maps wouldn't be very hard to add). Their syntax must be compatible with the Apache HTTP server. This way, no important hyperlinks are missed. (Many home pages consist of an image map and nothing else!) It is also able to analyze CGI scripts simply by calling them and analyzing their output. (Therefore, no HTTP server is needed!) Passing of variable parameters to CGI scripts is not supported, however, whereas passing of constants to all CGI scripts via environment variables is possible. (Passing of variable parameters (like query strings) is problematic con- ceptually: Imagine you get back a list (a possibly quite individual list at that) of hyperlinks from a full text search CGI script on your web site!) While the web site is being scanned, a detailed log file is written. Most of the time, it's a very good idea to read it because it lets you discover flaws in your web site that often go unnoticed otherwise! The files generated by this script (log file and output file) are never overwritten: instead, older versions are archived by appending an ever increasing number to their file names. This way, you can always go back to a previous state if anything bad should ever happen. How to use it: -------------- Simply install this script wherever you like. Although the script is quite fast (about 7 seconds on a web site with about 70 pages on a 486 66 MHz PC with FreeBSD), it's probably best to run this script once a night (as a "cron" job) or manually whenever you add or remove pages on your web site. Why make the users of your web site wait by using this script as a CGI script when they are in need of quick help and orientation?! This is also the reason why the page which is generated by this script doesn't use any graphics - it's intended to give your users assistance when they need it, in the fastest way possible! The configuration of the script is quite simple, just follow the directions in the script itself! You'll probably need to change the two subroutines "url_to_file" and "file_to_url" to reflect the file path conventions at your web site. If you are not using the "mod_rewrite" module of the Apache HTTP server, then nothing needs to be done except the removal of "/xxxx/../" --> "/". In that case, remove everything but the last four lines of code from these two subroutines! I.e., leave the following intact: while (${$thispage} =~ m!/[^\./]+/\.\./!) { ${$thispage} =~ s!/[^\./]+/\.\./!/!g; } You'll probably also want a different layout of the final page. Change the two subroutines "html_header" and "html_footer" accordingly! If your CGI scripts need more environment variables, add them in the sub- routine "setup_for_cgi"! If you want to see a working example, direct your web browser to the following site: http://www.sdm.de/ and click on "[hilfe]", or enter the corresponding URL directly: http://www.sdm.de/e/www/hilfe/ You can also download this script from there. Version history: ---------------- This is version 1.0, the first public release. Thanks: ------- None yet. :-) Final note: ----------- Please report any comments, problems, suggestions, findings, complaints, questions, insights, compliments or donations ;-) and so on to: sb@sdm.de (Steffen Beyer) With kind regards, -- Steffen Beyer ________________________ C:\ONGRATLN.W95 _______________________ mailto:sb@sdm.de |s |d &|m | software design & management GmbH&Co.KG phone: +49 89 63812-244 | | | | Thomas-Dehler-Str. 27 fax: +49 89 63812-150 | | | | 81737 Munich, Germany.