next_inactive up previous


Squidalyser 0.2.53

Simon Burns

Abstract:

A squid logfile analyser, designed to allow per-user scrutiny and analysis of squid logfiles. The program allows a non-technical user to extract information about web usage patterns, the type of information downloaded, the sites visited by users, the graphics downloaded, and the amount of information (per-byte or per-file) accessed.

The intention is to augment or replace the use of so-called `censorware' in certain environments, given that such systems of censorship are inevitably unreliable and imprecise1.


Contents

1 Squidalyser, installation and use

1.1 Introduction

Squidalyser is an interactive, web-based tool to help with the scrutiny and analysis of squid2 access logs. Scrutiny because you may be interested in what your users are looking at on the web, and squidalyser makes this very easy. Analysis because you may be interested in patterns of usage for squid, or which Internet sites are being accessed the most often -- squidalyser makes this easy too. The program is designed primarily for use in schools, although it should be flexible enough to be used in many other environments.

1.2 Disclaimer

This document and squidalyser itself is a work in progress, and may be inaccurate, incomplete or harmful -- or all three. As with information of any kind, they come with no warranty and you use them at your own risk. Feedback and bug fixes are welcome -- please send them to `squidalyser@ababa.org'.

1.3 Licence

As with all good software (which squidalyser aspires to be) this program and its associated documentation is released under the GPL:

http://www.gnu.org/licenses/gpl.txt
This means you can modify, redistribute and even sell the software, provided you adhere to the provisions of the licence -- not least that you grant the same rights to any recipients of this software.

1.4 How it works

The program consists of a number of perl scripts. The first, squidparse.pl, takes the information in your squid logfile and inserts it into a MySQL database. The second script, squidalyser.pl, runs through a web browser interface, and allows you to perform very specific or general queries on the database, to track web usage. A couple of auxiliary scripts make it easier to search for occurrences of lists of words in your logfile, and to track groups of users rather than individuals.

For example, you can:

All this is achieved through an economical and easy-to-use interface -- little or no technical knowledge is required to use squidalyser, although you will need to know some Unix commands to set it up.

2 Installation

2.1 Requirements

2.2 Preparing the sources

You should download the latest version of the squidalyser program, currently

http://ababa.org/dist/squidalyser-0.2.53.tar.gz
Enter the following command to unpack the files:

tar -zxvf squidalyser-0.2.53.tar.gz
This will create a `squidalyser-02.53' directory, into which the files will be unpacked.

2.3 Setting up MySQL

This document boldly assumes MySQL is running on your system: if not, download the appropriate MySQL distribution from www.mysql.com and follow the installation instructions. To set up the database required by squidalyser, you need to:

2.3.1 Creating the database

Access mysql using this command:

mysql -u root -p
`root' is the name of the user who has sufficient privileges to set up users and databases, and may be different on your system (ask your systems administrator if you are unsure). The `-p' option specifies that you need a password to access mysql: if you haven't set a root password, consult the MySQL documentation to find out how to do so.

Create the database by typing:

create database squid;
Note the semi-colon at the end of the line!

2.3.2 Creating the squidalyser user and granting access rights

Then grant permissions to the squidalyser user:

grant all privileges on squid.* to squidalyser@localhost identified by 'password';
You will need to devise your own password -- make sure you wrap it in quotation marks when you type the command. Type `exit' or Ctrl-D to quit mysql.

2.3.3 Creating the database structure

The database consists of two tables, which are created from the `squidalyser.sql' file in the `sql' subdirectory of the `squidalyser-0.2.5' directory:

mysql squid -u squidalyser -p < squidalyser.sql
This also inserts a few rows of data in the tables, so you can test that everything works.

2.3.4 Testing the database

Use these commands to test the squidalyser database:

mysql squid -u squidalyser -p

To gain access to the database.

show tables;

This should tell you that there are tables called `groups', `logfile', `members' and `wordlist'.

desc logfile;

There are eight fields in the table: id, remotehost, rfc931, authuser, request, status, bytes and time.

select count(*) from logfile;

There are 102 rows in the database, each representing an access to one web URL (ie a page, graphic, etc).

select sum(bytes) from logfile;

The resources downloaded total 918,531 bytes.

select max(bytes) from logfile;

The largest single item downloaded was 143,854 bytes.

select rfc931, max(bytes) from logfile group by rfc931;

This should show you the maximum file-size downloaded for each user in the database.
If you saw any error messages, you probably didn't follow the instructions to the letter, so go back and try again. If you need to go right back to the start, you can erase the database by typing

drop database squid;

(Note: you only need to do this if you need or want to start again!)
If everything worked, you should clear the test data from the database:

delete from logfile;
Then type `exit' or press Ctrl-D to quit mysql.

2.4 Installing the Perl modules

Perl modules extend the functionality of perl. The squidalyser scripts require the modules listed in section 2.1 above. You will need to be the `root' user on your system to install them5 -- if you are not, contact your systems administrator and ask for them to be installed.

This is not a tutorial on installing these modules. However, you can download them from www.cpan.org, or install them using

perl -MCPAN -e shell
If you are unsure about any of this, consult the CPAN FAQ at

http://www.cpan.org/misc/cpan-faq.html
paying particular attention to the sections entitled `How do I install Perl modules?' and `Where do I find Perl DBI/DBD/database documentation?'

2.5 Installing `squidparse.pl'

The squidparse.pl script takes data from the squid logfile and inserts it into the MySQL database. It probably needs to be run as `root', since it needs to read the logfile and ordinary users can't, under normal circumstances, do this. Copy the script and its configuration file to the appropriate location on your computer:

mkdir /usr/local/squidparse

cd squidalyser-0.2.53

cp squidparse/* /usr/local/squidparse
Next create a crontab entry to run the squidparse.pl script each morning:

crontab -e
This will invoke your editor. Type this line at the end of that file:

00 03 * * * /usr/local/squidparse/squidparse.pl
Then save the file, and the squidparse.pl script will be run each morning at 3am. You may decide you want to run it more frequently for a busy site -- consult the cron documentation for information about how to do this.

Since the script needs access to the database you set up, you need to edit squidalyser.conf to tell it the database username and password, etc. Use an editor to amend the information in the configuration file -- there are comments in the file to explain the usage of each item. Blank lines and those starting with # are ignored.

2.6 Installing `squidalyser.pl' and other CGI scripts

squidalyser.pl is the web-based program which does all the work for you, allowing you to retrieve meaningful information from the database. Copy it to a CGI directory on your web-server. On Linux, this could be located at `/home/httpd/cgi-bin' or `/var/www/cgi-bin' -- check with your systems administrator if unsure. To copy the files to the appropriate location:

cp cgi-bin/* /var/www/cgi-bin/
Then set the permissions and ownership:

chown apache: /var/www/cgi-bin/*.pl

chmod 755 /var/www/cgi-bin/*.pl
Your web-server may run under a different username, with `web', `httpd' and `nobody' being likely alternatives on a Linux system. Look in your httpd.conf for the `User' directive if you are unsure.

Finally, copy the icons from the `icons' subdirectory to your webserver's `icons' directory:

cp icons/* /var/www/icons

3 Running squidalyser

3.1 Priming the database

There is little point running squidalyser if your database contains no data! You can run the squidparse.pl script `by hand' if you wish, although it can take a while if your logfile is large. To do this, type:

cd /usr/local/squidalyser

./squidalyser.pl
Then wait, possibly for a few minutes, for the information to be inserted into MySQL.

3.2 Using squidalyaser

You should find that using the program is easy. Access it from:

http://localhost/cgi-bin/squidalyser.pl
(or alter the hostname to suit your setup).

3.2.1 Username

When it is invoked, the script extracts all usernames from the database and places them in a list. You can select multiple items from this list, or `All' to see information relating to all users. However, the `All' option can cause browser overload6, and so is not recommended. If you select any other item, it will take priority even if you have the `All' option highlighted (on the assumption that you selected it, or failed to deselect it, by accident).

3.2.2 Sub-string match

This should be a useful feature: it will return only those items which contain the sub-string in the URL. Here are some ideas about how you might use it:

3.2.3 Start time and End time

To speed up searches, and reduce the quantity of information returned, it is recommended that you enter start and end times for the searches. Using Perl's excellent Time::ParseDate module means you can enter dates and times in free-form. For example:

UK-style dates are preferred. This means that an ambiguous date such as 11/3/00 will be interpreted as 11th March rather than 3rd November. If this bothers you, edit squidalyser.conf (in /usr/local/squidparse) and alter the `timeformat' option to `US'.

3.2.4 User's system

This allows you to specify an IP address for the system used to access the web resources -- ie the one the user was sitting at, not your proxy server.

3.2.5 Output format

A list of pages visited
Returns a list, with URLs highlighted so you can easily visit a particular web-page or graphic yourself.
A list of sites visited
This strips out the resource part of the URL to return only the hostname. This will usually dramatically shorten the list returned, and give an overview of which sites your users are most interested in.
All the pictures viewed/downloaded
You will be amazed at just how fleshtones stand out from all those corporate logos *8) To concentrate on one type of graphics file use the `Sub-string match' option above8. Otherwise, the program will search for all files of type png, gif and jpg. This `pictures only' option is meaningless if you specify another extension (such as .mp3) in the `Sub-string match' field, and will return no results.
Blocked or failed accesses
This will return all accesses which are blocked, indicated with the code 40n or 50n. This could be a problem with the remote server, or indicate someone is repeatedly trying to access hosts which are blocked by your censorware.
A graph of activity (per byte)
This option draws a bar-chart illustrating how many bytes have been downloaded by each user. Peaks and troughs might indicate too much or too little use of the web; or that someone is over-using or misusing the web by downloading too many bulky pictures or music files, etc. The chart provides a convenient way to track abnormal behaviour.
A graph of activity (per item accessed)
This does much the same thing as the `per byte' chart, but counts the number of items accessed rather than the bytes.

3.2.6 Submit/reset options

Submit
starts the database query, and returns any results found to the browser window (at the foot of the page -- you may need to scroll down!)
Start again
clears all fields.
Reset
returns the form to its state immediately after the last set of results was returned. This is not quite the same as `Start again', since the form remembers your previous selections and keeps them when it shows you the results.

3.2.7 Using word lists

The word list feature allows you to search against a list of words, rather than entering them one at a time in the `Sub-string match' field. Click on the word list tab at the top of the screen, and enter the words in the first field -- either one word, or a list separated by commas or commas and spaces. Click on `Add' to add them to the word list, which is stored in the database so it will still be there next time you use the program. To remove words from the list, select them from the list-box and click on `Remove'.

When using squidalyser to query the database, do not enter words in the `Sub-string match' field; instead, click on the `Check against word list' option and submit the search. All URLs matching any word in the word list will be returned in the results.

3.2.8 Group manager

To save you selecting and deselecting usernames from the list on the squidalyser main-screen, you can define lists of users using the `Group manager'. When you select a group name from the `Groups' drop-down menu, and submit the search, all users in that group will be included in the database search.

To create a group, click on the `Group manager' tab at the top of the screen, enter the name of the group in the first field, and click on the `Create' button. Other fields and buttons will appear on-screen, to allow you to add users to the group, or remove them from the group.

You can create more than one group, and switch between them (or delete them) using the `Edit or delete group' menu.

4 Finally

4.1 Feedback and support

That's all folks! Check the web-page for new releases, which will also be notified on Freshmeat, COLA, etc. Fan-mail is appreciated, as are cash donations *8) and constructive criticisms. Bug reports and discussion are also invited -- email squidalyser@ababa.org. I hope the program proves to be useful, reliable and effective; let me know if I'm wrong.

4.2 Bugs and `FIXMEs'

I plan to look at these issues for a future release of the program. Items completed since earlier releases are show in italics.

  1. The substring matching needs to accept AND and OR keywords for greater flexibility.
  2. A friendly way to configure the DB variables (rather than editing the scripts directly) would be nice.
  3. An installation script would also be cool.
  4. There is a bug-like feature when using HTTP-style logging with squid, in that some lines from the logfile may be inserted into the database twice (only a few each time). This is a fairly minor problem, but you should be aware that it might slightly skew the results you see when using squidalyser.
  5. There is a rumour of a new-format logfile for the latest versions of squid. At the moment, squidparse doesn't know about this but will soon, as a matter of urgency.
  6. squidparse.pl is slow. (Thanks to Warren in .au for the patch which has significantly improved squidparse's performance.)
  7. For larger sites, some kind of grouping mechanism is desirable. In a classroom/school situation, for example, this might allow the supervising teacher to check what a class of users was doing at a particular time -- like last hour, or during Computer Club.
  8. A system to flag up suspicious-looking keywords, preferably from a user-defined list, would cut down on the work required to detect abuses of the web.
  9. Database performance would be improved with the use of indexes on columns.
  10. Totals and summaries should be included with query results. The list of sites visited (as opposed to the list of URLs) should indicate how many times a particular site on the list has been visited.
  11. The charts are cheap and nasty, mainly owing to time constraints. More flexible (and prettier) charts are on the way with the next release. Or maybe the release after that...
  12. squidparse.pl should expire records older than a certain date from the database. The time period used for such expiry will be user configurable.

4.3 Related software

Other squid logfile analysis programs can be found at

http://www.squidcache.org/Scripts
Sarg is recommended :-)

4.4 Case studies

Thanks to those who have contacted me about squidalyser. If you find yourself using the program on a regular basis, and if it proves useful, please let me know how you are using it. Please indicate if I may publish this information on the web-page (fully anonymised).

Apart from encouraging use of squidalyser, this information will help me to understand how the program is being used in real life. This will in turn feed into future development and, I hope, lead to a better program.

About this document ...

Squidalyser 0.2.53

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.50)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers /tmp/lyx_tmpdir2445Dd7nEq/lyx_tmpbuf2445aRVgOW/squidalyser-02_5.tex

The translation was initiated by Simon Burns on 2002-05-25


Footnotes

... imprecise1
See www.censorware.net for the arguments.
... squid2
See www.squid-cache.org for more information about the squid web proxy.
... users3
This is more efficient than you might think, since the graphics will probably be in your squid cache anyway, at least when you are analysing recent activity.
... program4
See www.squidguard.org for an example of such a program. Later versions of squidalyser will integrate more tightly with squidguard.
... them5
Well, that's a fudge for the sake of simplicity -- if you install as a non-root user, you will need to hack the scripts as well.
... overload6
Netscape under Linux, my test environment, seems particularly poor at dealing with large tables -- which is what the script will usually create as output.
...7
I was amazed, when employed as a net.cop in a school, how many searches for pornography started with the term `porn' being entered into a search engine. However, to see such searches in a URL, refer to the squid documentation, since squid usually crops URLs after the `?' as a security measure.
... above8
Hint: most photographs are stored as jpegs, with the extension .jpg or .jpeg. Most logos and buttons are stored as .gif or .png. So enter the extension into the substring match field.

next_inactive up previous
Simon Burns 2002-05-25