Glynx - a download manager.
Download from http://www.ipct.pucrs.br/flavio/glynx/glynx-latest.pl
Glynx makes a local image of a selected part of the internet.
It can be used to make download lists to be used with other download managers, making a distributed download process.
It currently supports resume, retry, referer, user-agent, java, frames,
distributed download (see --slave
, --stop
, --restart
).
It partially supports redirect, javascript, multimedia, authentication
It does not support mirroring (checking file dates), forms
It has not been tested with ``https'' yet.
It should be better tested with ``ftp''.
Tested on Linux and NT
$progname.pl [options] <URL>
$progname.pl [options] --dump="dump-file" <URL>
$progname.pl [options] "download-list-file"
$progname.pl [options] --dump="dump-file" <URL>
$progname.pl [options] --slave
How to make a default configuration:
Start the program with all command-line configurations, plus --cfg-save or: 1 - start the program with --cfg-save 2 - edit glynx.ini file
--subst, --exclude and --loop use regular expressions.
http://www.site.com/old.htm --subst=s/old/new/ downloads: http://www.acme.com/new.htm
- Note: the substitution string MUST be made of "valid URL" characters
--exclude=/\.gif/ will not download ".gif" files
- Note: Multiple --exclude are allowed:
--exclude=/gif/ --exclude=/jpeg/ will not download ".gif" or ".jpeg" files
It can also be written as: --exclude=/\.gif|\.jp.?g/i matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG
--exclude=/www\.site\.com/ will not download links containing the site name
http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/ won't download outside from directory "/bin". Prefix must end with a slash "/".
http://www.site.com/index%%%.htm --loop=%%%:0..3 will download: http://www.site.com/index0.htm http://www.site.com/index1.htm http://www.site.com/index2.htm http://www.site.com/index3.htm
- Note: the substitution string MUST be made of "valid URL" characters
- For multiple exclusion: use ``|''.
- Don't read directory-index:
?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A => \?[DSMN]=[AD]
To change default "exclude" pattern - put it in the configuration file
Note: ``File:'' item in dump file is ignored
You can filter the processing of a dump file using --prefix, --exclude, --subst
If after finishing downloading you still have ``.PART._BUSY_'' files in the base directory, rename them to ``.PART'' (the program should do this by itself)
Don't do this: --depth=1 --out-depth=3 because ``out-depth'' is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3
This will do nothing:
--dump=x graphic.gif
because the dump file gets all binary files.
Errors using https:
[ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or [ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]
This means you need to install at least ``openssl'' (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL
Very basic:
--version Print version number ($VERSION) and quit --verbose More output --quiet No output --help Help page --cfg-save Save configuration to file "$CFG_FILE" --base-dir=DIR Place to load/save files (default is "$BASE_DIR")
Download options are:
--sleep=SECS Sleep between gets, ie. go slowly (default is $SLEEP) --prefix=PREFIX Limit URLs to those which begin with PREFIX (default is URL base) Multiple "--prefix" are allowed. --depth=N Maximum depth to traverse (default is $DEPTH) --out-depth=N Maximum depth to traverse outside of PREFIX (default is $OUT_DEPTH) --referer=URI Set initial referer header (default is "$REFERER") --limit=N A limit on the number documents to get (default is $MAX_DOCS) --retry=N Maximum number of retrys (default is $RETRY_MAX) --timeout=SECS Timeout value - increases on retrys (default is $TIMEOUT) --agent=AGENT User agent name (default is "$AGENT")
Multi-process control:
--slave Wait until a download-list file is created (be a slave) --stop Stop slave --restart Stop and restart slave
Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):
--auth=USER:PASS Set authentication credentials for web site --hier Download into hierarchy (not all files into cwd) --iis Workaround IIS 2.0 bug by sending "Accept: */*" MIME header; translates backslashes (\) to forward slashes (/) --keepext=type Keep file extension for MIME types (comma-separated list) --nospace Translate spaces URLs (not #fragments) to underscores (_) --tolower Translate all URLs to lowercase (useful with IIS servers)
Other options: (to-be better explained)
--indexfile=FILE Index file in a directory (default is "$INDEXFILE") --part-suffix=.SUFFIX (default is "$PART_SUFFIX") (eg: ".Getright" ".PART") --dump=FILE (default is "$DUMP") make download-list file, to be used later --dump-max=N (default is $DUMP_MAX) number of links per download-list file --invalid-char=C (default is "$INVALID_CHAR") --exclude=/REGEXP/i (default is "@EXCLUDE") Don't download matching URLs Multiple --exclude are allowed --loop=REGEXP:INITIAL..FINAL (default is "$LOOP") (eg: xx:a,b,c xx:'01'..'10') --subst=s/REGEXP/VALUE/i (default is "$show_subst") (obs: "\" deve ser escrito "\\") --404-retry will retry on error 404 Not Found (default). --no404-retry creates an empty file on error 404 Not Found.
More command-line compatibility with lwp-rget
Graphical user interface
Glynx - a download manager.
Installation:
Windows: - unzip to a directory, such as c:\glynx or even c:\temp - this is a DOS script, it will not work properly if you double click it. However, you can put it in the startup directory in "slave mode" making a link with the --slave parameter. Then open another DOS window to operate it as a client. - the latest ActivePerl has all the modules needed, except for https.
Unix/Linux:
make a subdirectory and cd to it tar -xzf Glynx-[version].tar.gz chmod +x glynx.pl (if necessary) pod2html glynx.pl -o=glynx.htm (this is optional)
- under RedHat 6.2 I had to upgrade or install these modules: HTML::Tagset MIME:Base64 URI HTML::Parser Digest::MD5 libnet libwww-perl
- to use https you will need: openssl (www.openssl.org) Net::SSLeay IO::Socket::SSL
Please note that the software will create many files in its work directory, so it is advisable to have a dedicated sub-directory for it.
Goals:
generalize option to use (external) java and other script languages to extract links configurable file names and suffixes configurable dump file format plugins more protocols; download streams language support adhere to perl standards pod documentation distribution difficult to understand, fun to write parallelize things and multiple computer support cpu and memory optimizations accept hardware/internet failures restartable reduce internet traffic minimize requests cache everything other (from perlhack.pod) 1. Keep it fast, simple, and useful. 2. Keep features/concepts as orthogonal as possible. 3. No arbitrary limits (platforms, data sizes, cultures). 4. Keep it open and exciting to use/patch/advocate Perl everywhere. 5. Either assimilate new technologies, or build bridges to them.
Problems (not bugs):
- It takes some time to start the program; not practical for small single file downloads. - Command line only. It should have a graphical front-end; there exists a web front-end. - Hard to install if you don't have Perl or have outdated Perl modules. It works fine with Perl 5.6 modules. - slave mode uses "dump files", and doesn't delete them.
To-do (long list):
Bugs/debug/testing: - put // on exclude, etc if they don't have - arrays for $LOOP,$SUBST; accept multiple URL - Doesn't recreate unix links on "ftp". Should do that instead of duplicating files (same on http redirects). - uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode. This will have to be changed to "text/ftp-dir-listing" if we implement unix links. - install and test "https" - accept --url=http://... - accept --batch=...grx - ignore/accept comments: <! a href="..."> - nested comments??? - http server to make distributed downloads across the internet - use eval to avoid fatal errors; test for valid protocols - rename "old" .grx._BUSY_ files to .grx (timeout = 1 day?) option: touch busy file to show activity - don't ignore "File:" - unknown protocol is a fatal error - test: counting MAX_DOCS with retry - test: base-dir, out-depth, site leakage - test: authentication - test: redirect 3xx usar: www.ig.com.br ? - change the retry loop to a "while" - timeout changes after "slave" - leitura da configuracao: (1) le opcoes da linha de comando (pode trocar o arquivo .ini), (2) le configuracao .ini, (3) le opcoes da linha de comando de novo (pode ser override .ini), (4) le download-list-file (5) le opcoes da linha de comando de novo (pode ser override download-list-file) - execute/override download-list-file "File:" opcao: usar --subst=/k:\\temp/c:\\download/ Generalization, user-interface: - opcao no-download para reprocessar o cache - arquivo de log opcional para guardar os headers. Opcao: filename._HEADER_; --log-headers - make it a Perl module (crawler, robot?), generic, re-usable - option to understand robot-rules - make .glynx the default suffix for everything - try to support <form> through download-list-file - support mirroring (checking file dates) - internal small javascript interpreter - perl/tk front-end; finish web front end - config comment-string in download-list-file - config comment/uncomment for directives - arquivo default para dump sem parametros - "dump-[n]-1"? - more configuration parameters - opcao portugues/ingles? - plugins: for each chunk, page, link, new site, level change, dump file change, max files, on errors, retry level change. Opcao: usar callbacks. - dump suffix option - javascript interpreter option - scripting option (execute sequentially instead of parallel) - use environment - aceitar configuracao --nofollow="shtml" e --follow="xxx" - controle de hora, bytes por segundo - protocolo pnm: - realvideo, arquivos .rpm - streams - gnutella - 401 Authentication Required, generalize abort-on-error list, support --auth= (see rget) - opcao para reescrever paginas html com links relativos Standards/perl: - packaging for distribution, include rfcs, etc? - include a default ini file in package - include web front-end in package? - installation hints, package version problems (abs_url) - more english writing - include all lwp-rget options, or ignore without exiting - criar um objeto para as listas de links - escolher e especializar um existente. - check: 19.4.5 HTTP Header Fields in Multipart Body-Parts Content-Encoding Persistent connections: Connection-header Accept: */*, *.* - documentar melhor o uso de "\" em exclude e subst - ler, enviar, configurar cookies Network/parallel support: - timed downloads - start/stop hours - gravar arquivo "to-do" durante o processamento, para poder retomar em caso de interrupcao. ex: a cada 10 minutos - integrar com "k:\download" - receber / enviar comando restart / stop. Speed optimizations: - use an optional database connection - Persistent connections; - take a look at LWP::ParallelUserAgent - take a look at LWPng for simultaneous file transfers - take a look at LWP::Sitemapper - use eval around things do speed up program loading - opcao: pilhas diferentes dependendo do tipo de arquivo ou site, para acelerar a procura Other: - forms / PUT - Renomear a extensao de acordo com o mime-type (ou copiar para o outro nome). configuracao: --on-redirect=rename --on-redirect=copy --on-mime=rename --on-mime=copy - configurar tamanho maximo da URL - configurar profundidade maxima de subdiretorios - tamanho maximo do arquivo recebido - disco cheio / alternate dir - "--proxy=http:"1.1.1.1",ftp:"1.1.1.1" "--proxy="1.1.1.1" acessar proxy: $ua->proxy(...) Set/retrieve proxy URL for a scheme: $ua->proxy(['http', 'ftp'], 'http://proxy.sn.no:8001/'); $ua->proxy('gopher', 'http://proxy.sn.no:8001/'); - enable "--no-[option]" - accept empty "--dump" or "--no-dump" / "--nodump" --max-mb=100 limita o tamanho total do download --auth=USER:PASS nao e' realmente necessario, pode estar dentro da URL existe no lwp-rget --nospace permite links com espacos no nome (ver lwp-rget) --relative-links opcao para refazer os links para relativo --include=".exe" --nofollow=".shtml" --follow=".htm" opcoes de inclusao de arquivos (procurar links dentro) --full ou --depth=full opcao site inteiro --chunk=128000 --dump-all grava todos os links, incluindo os ja existentes e paginas processadas
Version history:
1.022: - multiple --prefix and --exclude seems to be working - uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode. - corrected errors creating directory and copying file on linux
1.021: - uses URI::Heuristic on command-line URL - shows error response headers (if verbose) - look at the 3rd parameter on 206 (when available -- otherwise it gives 500), Content-Length: 637055 --> if "206" this is "chunk" size Content-Range: bytes 1449076-2086130/2086131 --> THIS is file size - prefix of: http://rd.yahoo.com/footer/?http://travel.yahoo.com/ should be: http://rd.yahoo.com/footer/ - included: "wav" - sleep had 1 extra second - sleep makes tests even when sleep==0
1.020: oct-02-2000 - optimization: accepts 200, when expecting 206 - don't keep retrying when there is nothing to do - 404 Not Found error sometimes means "can't connect" - uses "--404-retry" - file read = binmode
1.019: - restart if program was modified (-M $0) - include "mov" - stop, restart
1.018: - better copy, rename and unlink - corrected binary dump when slave - comparacao de tamanho de arquivos corrigida - span e' um comando de css, que funciona como "a" (a href == span href); span class is not java
1.017: - sleep prints dots if verbose. - daemon mode (--slave) - url and input file are optional
1.016: sept-27-2000 - new name "glynx.pl" - verbose/quiet - exponential timeout on retry - storage control is a bit more efficient - you can filter the processing of a dump file using prefix, exclude, subst - more things in english, lots of new "to-do"; "goals" section - rename config file to glynx.ini
1.015: - first published version, under name "get.pl" - rotina unica de push/shift sem repeticao - traduzido parcialmente para ingles, revisao das mensagens
1.014: - verifica inside antes de incluir o link - corrige numeracao dos arquivos dump - header "Location", "Content-Base" - revisado "Content-Location"
1.013: - para otimizar: retirar repeticoes dentro da pagina - incluido "png" - cria/testa arquivo "not-found" - processa Content-Location - TESTAR - achar um site que use - incluido tipo "swf", "dcr" (shockwave) e "css" (style sheet) - corrige http://host/../file gravado em ./host/../file => ./file - retira caracteres estranhos vindos do javascript: ' ; - os retrys pendentes sao gravados somente no final. - (1) le opcoes, (2) le configuracao, (3) le opcoes de novo
1.012: - segmenta o arquivo dump durante o processamento, permitindo iniciar o download em paralelo a partir de outro processo/computador antes que a tarefa esteja totalmente terminada - utiliza indice para gravar o dump; nao destroi a lista que esta na memoria. - salva a configuracao completa junto com o dump; - salva/le get.ini
1.011: corrige autenticacao (prefix) corrige dump le dump salva/le $OUT_DEPTH, depth (individual), prefix no arquivo dump
1.010: resume se o site nao tem resume, tenta de novo e escolhe o melhor resultado (ideia do Silvio)
1.009: 404 not found nao enviado para o dump processa arquivo se o tipo mime for text/html (nao funciona para o cache) muda o referer dos links dependendo da base da resposta (redirect) considera arquivos de tamanho zero como "nao no cache" gera nome _INDEX_.HTM quando o final da URL tem "/".
1.008: trabalha internamente com URL absolutas corrige vazamento quando out-nivel=0
1.007: segmenta o arquivo dump acelera a procura em @processed corrige o nome do diretorio no arquivo dump
Other problems - design decisions to make
- se usar '' no eval nao precisa de \\ ? - paginas html redirecionadas devem receber um tag <BASE> no texto? - montar links usando java ? - a biblioteca perl faz sozinha Redirection 3xx ? - usar File::Path para criar diretorios ? - applets sempre tem .class no fim? - file names excessivamente longos - o que fazer? - usar: $ua->max_size([$bytes]) - nao funciona com callback - mudar o filename se a base da resposta e diferente? - criar arquivo PART com tamanho zero quando da erro 408 - timeout - como e' o formato dump do go!zilla?
Copyright (c) 2000 Flavio Glock <fglock@pucrs.br> All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.
If you use it/like it, send a postcard to the author.