NAME

Glynx - a download manager.

Download from http://www.ipct.pucrs.br/flavio/glynx/glynx-latest.pl


DESCRIPTION

Glynx makes a local image of a selected part of the internet.

It can be used to make download lists to be used with other download managers, making a distributed download process.

It currently supports resume, retry, referer, user-agent, java, frames, distributed download (see --slave, --stop, --restart).

It partially supports redirect, javascript, multimedia, authentication

It does not support mirroring (checking file dates), forms

It has not been tested with ``https'' yet.

It should be better tested with ``ftp''.

Tested on Linux and NT


SYNOPSIS

Do-everything at once:

 $progname.pl [options] <URL>
Save work to finish later:

 $progname.pl [options] --dump="dump-file" <URL>
Finish saved download:

 $progname.pl [options] "download-list-file"
Network mode (client/slave)
- Clients:

 $progname.pl [options] --dump="dump-file" <URL>
- Slaves (will wait until there is something to do):

 $progname.pl [options] --slave


HINTS

How to make a default configuration:

        Start the program with all command-line configurations, plus --cfg-save
        or:
        1 - start the program with --cfg-save
        2 - edit glynx.ini file

--subst, --exclude and --loop use regular expressions.

   http://www.site.com/old.htm --subst=s/old/new/
   downloads: http://www.acme.com/new.htm

   - Note: the substitution string MUST be made of "valid URL" characters

   --exclude=/\.gif/
   will not download ".gif" files

   - Note: Multiple --exclude are allowed:

   --exclude=/gif/  --exclude=/jpeg/
   will not download ".gif" or ".jpeg" files

   It can also be written as:
   --exclude=/\.gif|\.jp.?g/i
   matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG

   --exclude=/www\.site\.com/
   will not download links containing the site name

   http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/
   won't download outside from directory "/bin". Prefix must end with a slash "/".

   http://www.site.com/index%%%.htm --loop=%%%:0..3
   will download:
     http://www.site.com/index0.htm
     http://www.site.com/index1.htm
     http://www.site.com/index2.htm
     http://www.site.com/index3.htm

   - Note: the substitution string MUST be made of "valid URL" characters

- For multiple exclusion: use ``|''.

- Don't read directory-index:

        ?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A =>  \?[DSMN]=[AD] 

        To change default "exclude" pattern - put it in the configuration file

Note: ``File:'' item in dump file is ignored

You can filter the processing of a dump file using --prefix, --exclude, --subst

If after finishing downloading you still have ``.PART._BUSY_'' files in the base directory, rename them to ``.PART'' (the program should do this by itself)

Don't do this: --depth=1 --out-depth=3 because ``out-depth'' is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3

This will do nothing:

 --dump=x graphic.gif

because the dump file gets all binary files.

Errors using https:

 [ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or
 [ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]

This means you need to install at least ``openssl'' (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL


COMMAND-LINE OPTIONS

Very basic:

  --version         Print version number ($VERSION) and quit
  --verbose         More output
  --quiet           No output
  --help            Help page
  --cfg-save        Save configuration to file "$CFG_FILE"
  --base-dir=DIR    Place to load/save files (default is "$BASE_DIR")

Download options are:

  --sleep=SECS      Sleep between gets, ie. go slowly (default is $SLEEP)
  --prefix=PREFIX   Limit URLs to those which begin with PREFIX (default is URL base)
                    Multiple "--prefix" are allowed.
  --depth=N         Maximum depth to traverse (default is $DEPTH)
  --out-depth=N     Maximum depth to traverse outside of PREFIX (default is $OUT_DEPTH)
  --referer=URI     Set initial referer header (default is "$REFERER")
  --limit=N         A limit on the number documents to get (default is $MAX_DOCS)
  --retry=N         Maximum number of retrys (default is $RETRY_MAX)
  --timeout=SECS    Timeout value - increases on retrys (default is $TIMEOUT)
  --agent=AGENT     User agent name (default is "$AGENT")

Multi-process control:

  --slave           Wait until a download-list file is created (be a slave)
  --stop            Stop slave
  --restart         Stop and restart slave

Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):

  --auth=USER:PASS  Set authentication credentials for web site
  --hier            Download into hierarchy (not all files into cwd)
  --iis             Workaround IIS 2.0 bug by sending "Accept: */*" MIME
                    header; translates backslashes (\) to forward slashes (/)
  --keepext=type    Keep file extension for MIME types (comma-separated list)
  --nospace         Translate spaces URLs (not #fragments) to underscores (_)
  --tolower         Translate all URLs to lowercase (useful with IIS servers)

Other options: (to-be better explained)

  --indexfile=FILE  Index file in a directory (default is "$INDEXFILE")
  --part-suffix=.SUFFIX (default is "$PART_SUFFIX") (eg: ".Getright" ".PART")
  --dump=FILE       (default is "$DUMP") make download-list file, 
                    to be used later
  --dump-max=N      (default is $DUMP_MAX) number of links per download-list file 
  --invalid-char=C  (default is "$INVALID_CHAR")
  --exclude=/REGEXP/i (default is "@EXCLUDE") Don't download matching URLs
                    Multiple --exclude are allowed
  --loop=REGEXP:INITIAL..FINAL (default is "$LOOP") (eg: xx:a,b,c  xx:'01'..'10')
  --subst=s/REGEXP/VALUE/i (default is "$show_subst") (obs: "\" deve ser escrito "\\")
  --404-retry       will retry on error 404 Not Found (default). 
  --no404-retry     creates an empty file on error 404 Not Found.


TO-DO

More command-line compatibility with lwp-rget

Graphical user interface


README

Glynx - a download manager.

Installation:

    Windows:
        - unzip to a directory, such as c:\glynx or even c:\temp
        - this is a DOS script, it will not work properly if you double click it.
        However, you can put it in the startup directory in "slave mode" 
        making a link with the --slave parameter. Then open another DOS window
        to operate it as a client. 
        - the latest ActivePerl has all the modules needed, except for https.

    Unix/Linux:

        make a subdirectory and cd to it
        tar -xzf Glynx-[version].tar.gz
        chmod +x glynx.pl                 (if necessary)
        pod2html glynx.pl -o=glynx.htm    (this is optional)

        - under RedHat 6.2 I had to upgrade or install these modules:
        HTML::Tagset MIME:Base64 URI HTML::Parser Digest::MD5 libnet libwww-perl

        - to use https you will need:
        openssl (www.openssl.org) Net::SSLeay IO::Socket::SSL

    Please note that the software will create many files in 
    its work directory, so it is advisable to have a dedicated 
    sub-directory for it.

Goals:

        generalize 
                option to use (external) java and other script languages to extract links
                configurable file names and suffixes
                configurable dump file format
                plugins
                more protocols; download streams
                language support
        adhere to perl standards 
                pod documentation
                distribution
                difficult to understand, fun to write
        parallelize things and multiple computer support
        cpu and memory optimizations
        accept hardware/internet failures
                restartable
        reduce internet traffic
                minimize requests
                cache everything
        other (from perlhack.pod)
                1. Keep it fast, simple, and useful.
                2. Keep features/concepts as orthogonal as possible.
                3. No arbitrary limits (platforms, data sizes, cultures).
                4. Keep it open and exciting to use/patch/advocate Perl everywhere.
                5. Either assimilate new technologies, or build bridges to them.

Problems (not bugs):

        - It takes some time to start the program; not practical for small single file downloads.
        - Command line only. It should have a graphical front-end; there exists a web front-end.
        - Hard to install if you don't have Perl or have outdated Perl modules. It works fine
          with Perl 5.6 modules.
        - slave mode uses "dump files", and doesn't delete them.

To-do (long list):

        Bugs/debug/testing:
                - put // on exclude, etc if they don't have
                - arrays for $LOOP,$SUBST; accept multiple URL
                - Doesn't recreate unix links on "ftp". 
                Should do that instead of duplicating files (same on http redirects).
                - uses Accept:text/html to ask for an html listing of the directory when 
                in "ftp" mode. This will have to be changed to "text/ftp-dir-listing" if
                we implement unix links.
                - install and test "https"
                - accept --url=http://...
                - accept --batch=...grx
                - ignore/accept comments: <! a href="..."> - nested comments???
                - http server to make distributed downloads across the internet
                - use eval to avoid fatal errors; test for valid protocols
                - rename "old" .grx._BUSY_ files to .grx (timeout = 1 day?)
                  option: touch busy file to show activity
                - don't ignore "File:"; 
                - unknown protocol is a fatal error
                - test: counting MAX_DOCS with retry
                - test: base-dir, out-depth, site leakage
                - test: authentication
                - test: redirect 3xx
                        usar: www.ig.com.br ?
                - change the retry loop to a "while"
                - timeout changes after "slave"
                - leitura da configuracao:
                  (1) le opcoes da linha de comando (pode trocar o arquivo .ini), 
                  (2) le configuracao .ini, 
                  (3) le opcoes da linha de comando de novo (pode ser override .ini),
                  (4) le download-list-file
                  (5) le opcoes da linha de comando de novo (pode ser override download-list-file)
                - execute/override download-list-file "File:";
                  opcao: usar --subst=/k:\\temp/c:\\download/
        Generalization, user-interface:
                - opcao no-download para reprocessar o cache
                - arquivo de log opcional para guardar os headers. 
                  Opcao: filename._HEADER_; --log-headers
                - make it a Perl module (crawler, robot?), generic, re-usable 
                - option to understand robot-rules
                - make .glynx the default suffix for everything
                - try to support <form> through download-list-file
                - support mirroring (checking file dates)
                - internal small javascript interpreter
                - perl/tk front-end; finish web front end
                - config comment-string in download-list-file
                - config comment/uncomment for directives
                - arquivo default para dump sem parametros - "dump-[n]-1"?
                - more configuration parameters
                - opcao portugues/ingles?
                - plugins: for each chunk, page, link, new site, level change, dump file change, 
                  max files, on errors, retry level change. Opcao: usar callbacks.
                - dump suffix option
                - javascript interpreter option
                - scripting option (execute sequentially instead of parallel)
                - use environment
                - aceitar configuracao --nofollow="shtml" e --follow="xxx"
                - controle de hora, bytes por segundo
                - protocolo pnm: - realvideo, arquivos .rpm
                - streams
                - gnutella
                - 401 Authentication Required, generalize abort-on-error list, 
                  support --auth= (see rget)
                - opcao para reescrever paginas html com links relativos
        Standards/perl:
                - packaging for distribution, include rfcs, etc?
                - include a default ini file in package
                - include web front-end in package?
                - installation hints, package version problems (abs_url)
                - more english writing
                - include all lwp-rget options, or ignore without exiting
                - criar um objeto para as listas de links - escolher e especializar um existente.
                - check: 19.4.5 HTTP Header Fields in Multipart Body-Parts
                        Content-Encoding
                        Persistent connections: Connection-header
                        Accept: */*, *.*
                - documentar melhor o uso de "\" em exclude e subst
                - ler, enviar, configurar cookies
        Network/parallel support:               
                - timed downloads - start/stop hours
                - gravar arquivo "to-do" durante o processamento, 
                para poder retomar em caso de interrupcao.
                ex: a cada 10 minutos
                - integrar com "k:\download"
                - receber / enviar comando restart / stop.
        Speed optimizations:
                - use an optional database connection
                - Persistent connections;
                - take a look at LWP::ParallelUserAgent
                - take a look at LWPng for simultaneous file transfers
                - take a look at LWP::Sitemapper
                - use eval around things do speed up program loading
                - opcao: pilhas diferentes dependendo do tipo de arquivo ou site, para acelerar a procura
        Other:
                - forms / PUT
                - Renomear a extensao de acordo com o mime-type (ou copiar para o outro nome).
                configuracao:   --on-redirect=rename 
                                --on-redirect=copy
                                --on-mime=rename
                                --on-mime=copy
                - configurar tamanho maximo da URL
                - configurar profundidade maxima de subdiretorios
                - tamanho maximo do arquivo recebido
                - disco cheio / alternate dir
                - "--proxy=http:";1.1.1.1",ftp:";1.1.1.1"
                  "--proxy="1.1.1.1"
                    acessar proxy: $ua->proxy(...) Set/retrieve proxy URL for a scheme: 
                    $ua->proxy(['http', 'ftp'], 'http://proxy.sn.no:8001/');
                    $ua->proxy('gopher', 'http://proxy.sn.no:8001/');
                - enable "--no-[option]"
                - accept empty "--dump" or "--no-dump" / "--nodump"
                --max-mb=100
                        limita o tamanho total do download
                --auth=USER:PASS
                        nao e' realmente necessario, pode estar dentro da URL
                        existe no lwp-rget
                --nospace
                        permite links com espacos no nome (ver lwp-rget)
                --relative-links
                        opcao para refazer os links para relativo
                --include=".exe" --nofollow=".shtml" --follow=".htm"
                        opcoes de inclusao de arquivos (procurar links dentro)
                --full ou --depth=full
                        opcao site inteiro
                --chunk=128000
                --dump-all
                        grava todos os links, incluindo os ja existentes e paginas processadas

Version history:

 1.022:
        - multiple --prefix and --exclude seems to be working
        - uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode.
        - corrected errors creating directory and copying file on linux

 1.021:
        - uses URI::Heuristic on command-line URL
        - shows error response headers (if verbose)
        - look at the 3rd parameter on 206 (when available -- otherwise it gives 500),
                        Content-Length: 637055          --> if "206" this is "chunk" size
                        Content-Range: bytes 1449076-2086130/2086131 --> THIS is file size
        - prefix of: http://rd.yahoo.com/footer/?http://travel.yahoo.com/
          should be: http://rd.yahoo.com/footer/
        - included: "wav"
        - sleep had 1 extra second
        - sleep makes tests even when sleep==0

 1.020: oct-02-2000
        - optimization: accepts 200, when expecting 206
        - don't keep retrying when there is nothing to do
        - 404 Not Found error sometimes means "can't connect" - uses "--404-retry"
        - file read = binmode

 1.019: - restart if program was modified (-M $0)
        - include "mov"
        - stop, restart

 1.018: - better copy, rename and unlink
        - corrected binary dump when slave
        - comparacao de tamanho de arquivos corrigida
        - span e' um comando de css, que funciona como "a" (a href == span href);
          span class is not java

 1.017: - sleep prints dots if verbose.
        - daemon mode (--slave)
        - url and input file are optional

 1.016: sept-27-2000
        - new name "glynx.pl"
        - verbose/quiet
        - exponential timeout on retry
        - storage control is a bit more efficient
        - you can filter the processing of a dump file using prefix, exclude, subst
        - more things in english, lots of new "to-do"; "goals" section
        - rename config file to glynx.ini

 1.015: - first published version, under name "get.pl"
        - rotina unica de push/shift sem repeticao
        - traduzido parcialmente para ingles, revisao das mensagens

 1.014: - verifica inside antes de incluir o link
        - corrige numeracao dos arquivos dump
        - header "Location", "Content-Base"
        - revisado "Content-Location"

 1.013: - para otimizar: retirar repeticoes dentro da pagina
        - incluido "png"
        - cria/testa arquivo "not-found"
        - processa Content-Location - TESTAR - achar um site que use
        - incluido tipo "swf", "dcr" (shockwave) e "css" (style sheet)
        - corrige http://host/../file gravado em ./host/../file => ./file
        - retira caracteres estranhos vindos do javascript: ' ;
        - os retrys pendentes sao gravados somente no final.
        - (1) le opcoes, (2) le configuracao, (3) le opcoes de novo

 1.012: - segmenta o arquivo dump durante o processamento, permitindo iniciar o
        download em paralelo a partir de outro processo/computador antes que a tarefa esteja
        totalmente terminada
        - utiliza indice para gravar o dump; nao destroi a lista que esta na memoria.
        - salva a configuracao completa junto com o dump; 
        - salva/le get.ini

 1.011: corrige autenticacao (prefix)
        corrige dump
        le dump
        salva/le $OUT_DEPTH, depth (individual), prefix no arquivo dump

 1.010: resume
        se o site nao tem resume, tenta de novo e escolhe o melhor resultado (ideia do Silvio)

 1.009: 404 not found nao enviado para o dump
       processa arquivo se o tipo mime for text/html (nao funciona para o cache)
       muda o referer dos links dependendo da base da resposta (redirect)
       considera arquivos de tamanho zero como "nao no cache"
       gera nome _INDEX_.HTM quando o final da URL tem "/". 

 1.008: trabalha internamente com URL absolutas
       corrige vazamento quando out-nivel=0

 1.007: segmenta o arquivo dump 
       acelera a procura em @processed
       corrige o nome do diretorio no arquivo dump

Other problems - design decisions to make

 - se usar '' no eval nao precisa de \\ ?
 - paginas html redirecionadas devem receber um tag <BASE> no texto?
 - montar links usando java ?
 - a biblioteca perl faz sozinha Redirection 3xx ?
 - usar File::Path para criar diretorios ?
 - applets sempre tem .class no fim?
 - file names excessivamente longos - o que fazer?
 - usar: $ua->max_size([$bytes]) - nao funciona com callback
 - mudar o filename se a base da resposta e diferente?
 - criar arquivo PART com tamanho zero quando da erro 408 - timeout
 - como e' o formato dump do go!zilla?


COPYRIGHT

Copyright (c) 2000 Flavio Glock <fglock@pucrs.br> All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.

If you use it/like it, send a postcard to the author.