GEnome ANalysis and protein FAMily MakER

                             Geanfammer Package


Version.

   "geanfammer_suite_1.0.1.tar.gz"

   1.0.1. was put into the ftp directory of LMB on 14th Oct. 1997.
   (ftp://ftp.mrc-lmb.cam.ac.uk/genomes/Software/Geanfammer)
   Not yet 100% public.

Introduction.

   Geanfammer is a comprehensive package of programs and a Perl subroutine library.
   It is the result of an analysis of bacterial genomes published since 1995.

   It is composed of two types of programs. One is a single complete program
   called 'geanfammer.pl' which creates a single output file which shows all the
   sequence domains ( duplication modules or seqlets ) existing in any set of
   protein sequence database. A good example input would be one complete genome
   protein sequence in a fasta file format.

   (Usage example)

           "geanfammer.pl Your_Genome.fa E=10 e=10"

   Above will produce "Your_Genome.gclu" or "sorted_cluster_file.gclu".
   The extension of gclu means 'good clustering'. The word good does not mean
   it is a biologically correct clustering, but domain level clustering was
   successful. In authors' opinion, it is biologically significant.

   The other is a collection of programs which are essentially the parts of
   geanfammer.pl. In fact, geanfammer is the integration of all the component
   programs.
   These are included in the package so that users can handle the steps of
   the geanfammer program when any problems are encountered.

   They are necessary for the steps of following:

       1. Sequence search (to create MSSO files, SSO is a generic term for several SSO files)
       2. Converting the resultant MSSO files to more useful MSP files
       3. Making Single linkage cluster (a big and wrong cluster).
       4. Breakingdown the single linkage to domain level linkage
       5. Summarising the results into a single GCLU file. (additionally a summary file is created)


Test Run

   We have included a test FASTA format file which has protein families with
   sequences of 1 to 8.
   In fact the sequences in one family are identical so the single linkage and
   final domain clustering results should be identical. The one single sequence
   fam_1_1 will disappear as we do not regard orphan sequence as any family member.

   To run it, type:   " geanfammer.pl geanfammer_test_FASTA_DB.fa E=30 e=30"

   The 2 evalues are absurdly high for the test db as test db has only
   around 30 sequences. Evalues are dependent on the DB size. If you have around
   1000 sequences in your DB, you can perhaps choose 0.05 or 0.01 for clustering.
   It will produce tons of files, so it is better to make a subdir first.
   The single linkage clustering file is called xxxx.sclu  while the domain level
   clustering is called as xxxx.gclu.

   The result of the FASTA or SSEARCH run are stored in FA subdirectory. FA is
   created as all the file names of the test DB has names like fam_X_X. geanfammer
   takes the first 2 chars.
   Inside FA, you can see xxxx.msp files. These are the summary of the search.
   To know what MSP file format is about, check out:

   http://www.mrc-lmb.cam.ac.uk/genomes/msp_file_format_example.gif


Geanfammer Module

   A file called Geanfammer.pm is included which is the product of 'pl2pm' program
   which is distributed with Perl. To use this module, you can either use
   'require' or 'use' in your perl program. This is if you want to use any
   subroutines in Geanfammer while you do not want to copy the subrouine into
   your program. All the subroutines in geanfammer can also be found in CPAN site:

   ftp://unix.hensa.ac.uk/mirrors/perl-CPAN/modules/by-authors/Jong_Park/


Requirements and Installation.

   1. Any computer operating system can be used as long as Perl version 5 is installed.
      This includes LINUX, WinNT, Windows95, UNIX, Mac, and many others.

      The perl interpreter/Compiler path is set at the first line of the program as:

             #!/usr/bin/perl

      If your perl is not linked or installed in that place, please change
      the path to your own.

   2. Copy geanfammer.pl and all the accompanied .pl files to your execution
      path for example  /usr/local/bin  or  /usr/people/John_Smith/bin  etc.

   That is all.

Add-ons.

   3.1. Faster C Binary version.

        If it is absolutely necessary, we can compile the perl codes to C binary
        to increase the speed of the division of the wrong single linkage to domain
        level clustering. There are many different platforms we have to tackle, so
        unless we have the OS you have, it is not possible to make such binary. We
        will be happy to tell you how to compile it. It is simple to make one.

   Any suggestions for an improvement is welcome and please contact to the following
   email addresses or a post address.


References.

   http://www.mrc-lmb.cam.ac.uk/genomes/geanfammer.html


Contacts.

   Sarah A. Teichmann and Jong H. Park
   sat@mrc-lmb.cam.ac.uk,  jong@mrc-lmb.cam.ac.uk

   Dept. of Structural Studies,
   Laboratory of Molecular Biology (LMB)
   MRC Centre, Hills Road, Cambridge, CB22QH, UK,
   Tel: +44 01223 402479


Copyright problem.

   The codes in the package are under the same term of Perl itself. This essentially
   means, as long as you respect the developers time and work, it is freely
   available.


Acknowledgement.

   As all scientific work is essentially a community work, Jong thanks all the past and
   present scientists for their devotion to science.
   Alex Bateman, Bissan Al-lazikani, Tim Hubbard, Graeme Mitchison etc have been
   helpful for many occasions.


Appendix.

   Following is the header of the program geanfammer.pl.


#________________________________________________________________________________
# Title     : geanfammer.pl
# Usage     : geanfammer.pl DATABASE(or GENOME)
#
# Function  : creates a domain level clustering file from a given FASTA format sequence
#             DB. It has been used for complete genome sequence analysis.
#
# Example   : geanfammer.pl E_cli_gnome.fa                  # simplest form of execution
#             geanfammer.pl E_cli_gnome.fa a=ssearch        # use SSEARCH instead of FASTA
#             geanfammer.pl E_cli_gnome.fa o                # for overwriting files in running
#                                                           # when you want a fresh run over old
#             geanfammer.pl E_cli_gnome.fa c                # For keeping MSSO files (fasta output)
#             geanfammer.pl E_cli_gnome.fa k=2              # changing default k tuple for FASTA to 2
#             geanfammer.pl E_cli_gnome.fa E=0.001          # set the E value for initial single linkage
#                                                           #  clustering
#             geanfammer.pl E_cli_gnome.fa e=0.001          # set the E value for domain level linkage
#             geanfammer.pl E_cli_gnome.fa e=0.001 E=0.01   # set the 2 E values
#
# Keywords  : genome_analysis_and_protein_family_maker, genome_ana_protein_fam_maker
# Options   :
#             o  for overwrite existing xxxx.fa files for search
#             c  for create MSSO or SSO file (sequence search out file)
#             d  for very simple run and saving the result in xxxx.gz format in sub dir starting with one char
#             k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
#             a= for choosing either fasta or ssearch algorithm
#             E= for Evalue cutoff for single linkage clustering $E_cut_main
#             e= for Evalue cutoff for divide_clusters subroutine.
#
#   !! Do not remove the following lines down to # Author line. This program parses them
#
#  $over_write=o by o -o
#  $k_tuple= by k=
#  $upper_expect_limit= by u=
#  $lower_expect_limit= by l=
#  $algorithm= by a=
#  $No_processing=N by N -N
#  $single_msp=s by s -s
#  $sequence_db_fasta= by DB=
#  $query_file= by File=
#  $machine_readable=M by M -M
#  $make_subdir_out=D by D
#  $make_subdir_gzipped=d by d -d
#  $direct_MSP_conversion=m by m -m
#  $verbose_opt=v by v -v
#  $sub_dir_size= by d=
#  $Evalue_cut_single_link= by E=
#  $Evalue_cut_divclus= by e=
#  $optimize=z by z -z
#
# Author    : Sarah A Teichmann, Jong Park, sat@mrc-lmb.cam.ac.uk, jong@mrc-lmb.cam.ac.uk
# Version   : 1.0
#--------------------------------------------------------------------------------