PROPAGANDA:

See http://www.math.uio.no/~janl/w3mir/ for propaganda.

--------------------------------------------------------------------------
FAQS:

Q: W3mir takes a long time between each document it fetches.
A: Yes, it's being nice to the server, sleeping 30 seconds between
   each connection.  See -p switch.

Q: Where can I get w3mir?
A: http://www.math.uio.no/~janl/w3mir/

Q: Are there any mailing lists?
A: Yes, see below.

Q: Should I subscribe to any of the mailinglists?
A: Yes, if you use w3mir at all you should subscribe to 
   w3mir-info@usit.uio.no, send e-mail to janl@math.uio.no to be
   subscribed.

Q: I found a bug!
A: See below.

Q: Does it handle cgis?
A: In a manner of speaking.  The default is to just execute the cgi and
   save the result.  Instead you can also let all references to cgis
   be ignored and let the point back to the original site.  This requires
   use of a config file and directives like:
     Ignore: *.cgi
     Ignore: *-cgi

Q: Does it handle imagemaps?
A: It does handle client side imagemaps, but not server side imagemaps.
   Server side imagemaps can be handled like cgis, they point back to the
   original site:
     Ignore: *.map

--------------------------------------------------------------------------
BUGS:

- None :-)

Please see below for how to report bugs.

--------------------------------------------------------------------------
FEATURES (NOT bugs):

- Urls with two /es ('//') in the path component does not work
  as some might expect.  According to my reading of the http/url spec.
  it is an illegal construct, which is a Good Thing, because I don't
  know how to handle it if it's legal.
- If you start at http://foo/bar/ index.html might be gotten twice.
  We don't know if the server serves Welcome.html or index.html when a
  directory is requested.  w3mir assumes index.html though.

--------------------------------------------------------------------------
MAIL LISTS, REPORTING BUGS:

Please send bug reports to w3mir-core@usit.uio.no, please include URL
and command line that triggered the bug.  Ideas (see todo lists
further down please), questions about usage, general discussions and
other related talk to w3mir-info@usit.uio.no.  To subscribe to these
lists email janl@math.uio.no.  The w3mir-core list is intended for
w3mir hackers only.

--------------------------------------------------------------------------
COPYRIGTHS:

w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
Copyrighted by the various involved hackers.  If you want to copy,
hack or distribte w3mir you can do that providing you comply with the
'Artistic License' enclosed in the w3mir distribution in the file
named Artistic.

--------------------------------------------------------------------------
CREDITS:

- Oscar Nierstrasz: Wrote htget
- Gorm Haug Eriksen: Started w3mir on the foundations of htget,
	contributed code later.
- Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
  everything.
- Chris Szurgot: Adapting to win32, good ideas and code contribs, 
  Debugging.  And criticism.
- Ed Jordan: patch, debugging.
- Rik Faith: Uses w3mir extensively, not shy about complaining and
  commenting and suggesting.  
- The libwww-perl author(s) that made adding some new featres
  ridicolously easy.

--------------------------------------------------------------------------
TODO LIST:

Currently I'm preparing for version 1 of w3mir.

* TODO for version 1:

- Documentation.
- A max depth switch, implement with ignore rule?
- -B switch to batch get a series of URLs (given on commandline)
  The files will _not_ be processed.
- -I switch to combine with -B to read URLs from standard-input.
- -B combined with -r should supress any other processing than url
  listing.
- Re-introduce -abs
- A way, in the config file to quene more than one 'root' document.  Say
  with 'Quene:' and 'Also-Quene:' which is like 'Also:' and 'Quene:'
- Fix bugs discovered.

- Ability to dump the referers list to file, needed for below program.
- Ability to dump the redirects encountered to file, also needed for 
  program below.

This will be implemented in a separate program, I think:
- Fixup _all_ links to documents that are redirected somewhere else.
  This will make a mirror mirrored from a document scope which contains
  browsable.
- Add switch to specify name of file served when directory is requested.
  Two widely used names are 'index.html' (originated in NCSA HTTPD I think) 
  and 'Welcome.html' (the Cern HTTPD way).  Should be acompanied by code to
  change foo/ urls too foo/index.html in the html docs, that way they are
  browseable in filesystems (i.e. on CD-ROM).


* TODO, after version 1:

Some of these are speculative, some others are very useful.

- Integrate with cvs or rcs (or other version controll system) to make
  retriver able to reproduce mirrored site for any given date.
- Ability to check for existense of documents outside scope, for link 
  checking (through a 'second-order' list)
- Some text processing:  Adding and removing text/sgml comments when suitable
  options and tags are found.  Suggested by Ed Jordan.
- Feature to add retrival date in html comment or document text
  for documentation or other purposes.
- Example: If you're mirroring a site primarily to get to the papers, but
  the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
  foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version.  Implement
  a way to get only one version of documents provided in multipele versions,
  something like multi axis preference list to get only the most attractive
  version of the doc.
- A reverse hash so we can find out what url a file came from, and detect
  namespace colitions if desierable (could issue warning, not get doc...)
- Your suggestion here.

* TODO, http related
- Use Keep-alive.  Then we should probably stop using 30 second pauses
  between document retrivals.
- HTTP/1.1?  HTTP/1.1 servers should do keep-alive even with 1.0 requests.
- Separate quenes for each server, interleave requests.