GEnome ANalysis and protein FAMily MakER

														 Geanfammer Package

											Cyrus Chothia group. MRC, Cambridge

			 Ref: http://cyrah.med.harvard.edu/~jong/divclus_paper_abstract.html


Version.

	"Geanfammer_V1.6.2.tar.gz"   5th/Jan/99
	 1.6.2 Fixed a bug in merge_sequence_in_msp_chunk and merge_sequence_in_msp_file
	       Disabled a lot of printout information to speed up

	"Geanfammer_V1.6.1.tar.gz"   29th/Aug/98
	 1.6.1 Fixed a bug in convert_clu_to_msp

	"Geanfammer_V1.6.tar.gz"   28th/Aug/98
	 1.6 Fixed a bug in msp_single_link_hash,
	 1.6 Fixed a bug in msp_single_link.pl (read_file_names_only updated)

	"Geanfammer_V1.5.tar.gz"   8th/Aug/98
	 1.5 runs PSI-blast
	 1.5 had a bug in convert_bla_to_msp and now it safely uses PSI-Blast output
	 1.5 uses 'formatdb' of Blast program to create DB automatically(when
			 formatdb is in executable path
	 1.5 was put in ftp dir of LMB,  1.5 is also found in major CPAN sites.

	"geanfammer_suite_1.4.tar.gz"   6th/July/98
	 1.4 was put in ftp dir of LMB
	 1.4 was a bug fix version. Helped mainly Lily Fu (lily@tigr.org)
	 1.4 is also found in major CPAN sites.
	"geanfammer_suite_1.3.tar.gz"
	 1.3    was put in the ftp dir of LMB on 25th Jan. 1998
	 1.0.1. was put into the ftp directory of LMB on 14th Oct. 1997.
	 (ftp://ftp.mrc-lmb.cam.ac.uk/genomes/Software/Geanfammer)
	 Not yet 100% public.

Requirements:

	 1) Perl5 installed. (If you do not have it, get it now, as it will save time)

				http://www.perl.com/

	 2) FASTA pairwise search program must be installed in the executable path.
				(I am sure you have it, if you are in Biology field)
				FASTA can be found: ftp://ftp.virginia.edu/pub/fasta/
	 3) Linux OS is recommanded. It is fastest and cheapest with cheapest hardware.


Introduction.

	 Geanfammer is a comprehensive package of programs and a Perl subroutine library.
	 It is the result of an analysis of bacterial genomes published since 1995.

	 It is composed of two types of programs. One is a single complete program
	 called 'geanfammer.pl' which creates a single output file which shows all the
	 sequence domains ( duplication modules or seqlets ) existing in any set of
	 protein sequence database. A good example input would be one complete genome
	 protein sequence in a fasta file format.

	 (Usage example)

					 "geanfammer.pl Your_Genome_or_DB.fa"        or
					 "geanfammer.pl Your_Genome.fa E=10 e=10"

	 Above will produce "Your_Genome.gclu" or "sorted_cluster_file.gclu".
	 The extension of gclu means 'good clustering'. The word good does not mean
	 it is a biologically correct clustering, but domain level clustering was
	 successful. In authors' opinion, it is biologically significant.

	 The other is a collection of programs which are essentially the parts of
	 geanfammer.pl. In fact, geanfammer is the integration of all the component
	 programs.
	 These are included in the package so that users can handle the steps of
	 the geanfammer program when any problems are encountered.

	 They are necessary for the steps of following:

			 1. Sequence search (to create MSSO files, SSO is a generic term for several SSO files)
			 2. Converting the resultant MSSO files to more useful MSP files
			 3. Making Single linkage cluster (a big and wrong cluster).
			 4. Breakingdown the single linkage to domain level linkage
			 5. Summarising the results into a single GCLU file. (additionally a summary file is created)


Test Run

	 We have included a test FASTA format file which has protein families with
	 sequences of 1 to 8.
	 In fact the sequences in one family are identical so the single linkage and
	 final domain clustering results should be identical. The one single sequence
	 fam_1_1 will disappear as we do not regard orphan sequence as any family member.

	 To run it, type:   " geanfammer.pl geanfammer_test_FASTA_DB.fa E=30 e=30"

	 The 2 evalues are absurdly high for the test db as test db has only
	 around 30 sequences. Evalues are dependent on the DB size. If you have around
	 1000 sequences in your DB, you can perhaps choose 0.05 or 0.01 for clustering.
	 It will produce tons of files, so it is better to make a subdir first.
	 The single linkage clustering file is called xxxx.sclu  while the domain level
	 clustering is called as xxxx.gclu.

	 The result of the FASTA or SSEARCH run are stored in FA subdirectory. FA is
	 created as all the file names of the test DB has names like fam_X_X. geanfammer
	 takes the first 2 chars.
	 Inside FA, you can see xxxx.msp files. These are the summary of the search.
	 To know what MSP file format is about, check out:

	 http://www.mrc-lmb.cam.ac.uk/genomes/msp_file_format_example.gif


Real Genome TEST!!

	 We have included the smallest complete Mycoplasma genitalium genome (MG.fa)
	 in the distribution to play with.
	 According to your choice of E value threshold geanfammer should produce
	 a domain level clustering.

	 Try:   geanfammer.pl MG.fa E=0.2 e=0.2

	 and see what it produces. The search part of the program will take
	 the most time. It will produce a subdirectory called MG in which
	 the results of search will be stored. Final results will be made in
	 the present directory. So, it is a good idea to make a new directory
	 for the test and run geanfammer inside it.


Geanfammer Module

	 A file called Geanfammer.pm is included which is the product of 'pl2pm' program
	 which is distributed with Perl. To use this module, you can either use
	 'require' or 'use' in your perl program. This is if you want to use any
	 subroutines in Geanfammer while you do not want to copy the subrouine into
	 your program. All the subroutines in geanfammer can also be found in CPAN site:

	 ftp://unix.hensa.ac.uk/mirrors/perl-CPAN/modules/by-authors/Jong_Park/
	 http://www.perl.com/CPAN-local//modules/by-category/23_Miscellaneous_Modules/Bio/

Installation.

	 1. Any computer operating system can be used as long as Perl version 5 is installed.
			This includes LINUX, WinNT, Windows95, UNIX, Mac, and many others.

			The perl interpreter/Compiler path is set at the first line of the program as:

						 #!/usr/bin/perl

			If your perl is not linked or installed in that place, please change
			the path to your own.

	 2. Copy geanfammer.pl and all the accompanied .pl files to your execution
			path for example  /usr/bin/ /usr/local/bin  or /usr/people/John_Smith/bin  etc.

	 That is all.

Add-ons.

	 3.1. Faster C Binary version.

				If it is absolutely necessary, we can compile the perl codes to C binary
				to increase the speed of the division of the wrong single linkage to domain
				level clustering. There are many different platforms we have to tackle, so
				unless we have the OS you have, it is not possible to make such binary. We
				will be happy to tell you how to compile it. It is simple to make one.

	 Any suggestions for an improvement is welcome and please contact to the following
	 email addresses or a post address.


References.

	 http://www.mrc-lmb.cam.ac.uk/genomes/geanfammer.html


Contacts.

	 Sarah A. Teichmann and Jong H. Park
	 sat@mrc-lmb.cam.ac.uk,  jong@mrc-lmb.cam.ac.uk

	 Division of Structural Studies,
	 Laboratory of Molecular Biology (LMB)
	 MRC Centre, Hills Road, Cambridge, CB22QH, UK,
	 Tel: +44 01223 402479


Copyright problem.

	 The codes in the package are under the same term of Perl itself. This essentially
	 means, as long as you respect the developers time and work, it is freely
	 available.
	 If you are in a company, please contact Sarah or Jong to use it for
	 commercial purpose. We encourage people use commercially to help people
	 help others in domain finding. We nominally charge to provide the rights
	 to use the package.


Acknowledgement.

	 As all scientific work is essentially a community work, Jong thanks all the past and
	 present scientists for their devotion to science.
	 Alex Bateman, Bissan Al-lazikani, Tim Hubbard, Graeme Mitchison etc have been
	 helpful for many occasions.


Appendix.

		BIOINFORMATICS, Volume 14, Issue 2: March 1998.

		DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and
		multi-domain proteins

		Jong Park1,2 and Sarah A. Teichmann1

		1MRC Laboratory of Molecular Biology and 2Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK

		Abstract

		Motivation: Large-scale determination of relationships between the proteins produced by genome sequences is now common. All protein sequences are matched and
		those that have high match scores are clustered into families. In cases where the proteins are built of several domains or duplication modules, this can lead to misleading
		results. Consider the very simple example of three proteins: 1, formed by duplication modules A and B; 2, formed by duplication modules B[prime] and C; and 3, formed
		by duplication modules C[prime] and D. Duplication modules B and B[prime] are homologous, as are C and C[prime]. Matching the sequences of 1, 2 and 3 followed by
		simple single-linkage clustering would put all three in the same family, even though proteins 1 and 3 are not related. This is because the different parts of 2 match 1 and 3.
		This paper describes a procedure, DIVCLUS, that divides such complex clusters of partially related sequences into simple clusters that contain only related duplication
		modules. In the example just given, it would produce two groups of sequences: the first with domains B of sequence 1 and B of sequence 2, and the second with domain
		C of sequence 2 and C of sequence 3. DIVCLUS is part of a package called GEANFAMMER, for GEnome ANalysis and protein FAMily MakER. The package
		automates the detection of families of duplication modules from a protein sequence database.
		Results: DIVCLUS has been applied to the division of single-linkage clusters generated from the protein sequences of six completely sequenced bacterial genomes. Out
		of 12 013 genes in these six genomes, 4563 single- and multi-domain sequences formed 1071 complex clusters. Application of the DIVCLUS program resolved these
		clusters into 2113 clusters corresponding to single duplication modules.
		Availability: The perl5 program and its documentation are available at the following address: http://www.mrc-lmb.cam.ac.uk/genomes/ and by anonymous ftp at
		ftp.mrc-lmb.cam.ac.uk in the directory /pub/genomes/Software/.
		Contact: sat@mrc-lmb.cam.ac.uk; jong@mrc-lmb.cam.ac.uk


Following is the header of the program geanfammer.pl.

#________________________________________________________________________
# Title     : geanfammer.pl
#
# Usage     : geanfammer.pl DATABASE(or GENOME) [e= ] [f=]
#                  * look at the Example section down below!
#
# Function  : Creates a domain level clustering file from a given
#              FASTA format sequence DB. It has been used for complete
#              genome sequence analysis.
#
#              ------------ USAGE INFORMATION -------------------
#             The parameters you put are important for the meaningful
#               protein family maker.
#             The most important one is the E and e options (Mostly,
#               they will have same value).
#             Large E is for setting the threshold for the single
#               linkage clustering.
#             This means, any sequence hit BELOW the threshold
#               (which is good ) will be linked.
#             For example, if Seq1 matched with Seq2 with E value
#              of FASTA search:
#              0.001, and you set the threshold 0.1, then YOU
#              ordered the geanfammer to regard them a family.
#
#             The second small e option is for the dividing a complex
#              and wrong cluster into correct more correct
#              duplication modules. This is necessary as a
#              lot of multidomain proteins can be clustered together
#              WRONGLY by single linkage.
#             At this stage, the e value is irrelevant to E value
#              and you can set a higher or lower one. Or you can set
#              the same as E(just set the 2 the same!)
#
#             Rough guide from our experience for E and e values:
#              We know that with 1000 sequence database, 0.01
#              produces around 1% error in grouping sequences
#              according to the evalue.
#              With 180,000, 0.081 gave us less than 1% error.
#             Evalue of FASTA and SSEARCH is DEPENDENT on DB size,
#              so you need to play a little bit to know the best
#              E value for your OWN database or genome.
#             The best approach is :
#               1) You run geanfammer.pl with any of your target DB
#                  with certain E value you like
#               2) Check sequence families which are clustered
#                  in the final resultant file xxxx.gclu and decide
#                  if the E value is low or high. Lower evalues will
#                  make sure you do not make wrong clusters while
#                  high evalue will include more probable sequence
#                  family members.
#               3) Put all the xxxx.msp files in subdirectory(s)
#                  created by geanfammer and run divclus.pl (which
#                  is accompanied in the package) with different
#                  Evalues. Divclus will not run any search algorithm
#                  etc, so it can be done fairly quickly.
#
#          * Most of the subroutines are found in Bio.pl or Bio.pm
#            Bioperl library for perl.
#
#
# Example   :geanfammer.pl E_gnme.fa             # simplest form
#            geanfammer.pl E_gnme.fa a=ssearch   # use SSEARCH
#            geanfammer.pl E_gnme.fa o           # for overwriting
#                                                   when you want a
#                                                   fresh run ovr old
#            geanfammer.pl E_gnme.fa c         # For keeping
#                                                 SSO files
#                                                 (fasta output)
#            geanfammer.pl E_gnme.fa k=2       # changing default
#                                                 k tuple for
#                                                 FASTA to 2
#            geanfammer.pl E_gnme.fa E=0.01     # set the E value
#                                                 for initial single
#                                                 linkage clustering
#            geanfammer.pl E_gnme.fa e=0.01    # set the E value
#                                                for domain level linkage
#       -->  geanfammer.pl E_gnme.fa e=0.01 E=0.01 # set the 2 E values
#                                                    separately (no need
#                                                    to do this)
#
# Keywords  : genome_analysis_and_protein_family_maker,
#             genome_ana_protein_fam_maker
# Options   :
#             o  for overwrite existing xxxx.fa files for search
#             c  for create SSO file (sequence search out file)
#             d  for very simple run and saving the result in
#                    xxxx.gz format in sub dir starting with one char
#             N
#             s
#             m
#             v  for debugging purpose. It says more to you while running
#             z
#             D  for making subdir like ./MG or /FA in PWD. For clean PWD
#             L  for Lean output(removes all the intermediate
#                                     outputs to save space)
#             u  for making separate summary file (redundant now)
#
#             DB=
#             File=
#             k= for k-tuple value. default is 1 (ori. FASTA prog.
#                                                   default is 2)
#             a= for choosing either fasta or ssearch algorithm
#                    You can set absolute path like (/usr/bin/fasta)
#             E= for Evalue cutoff for single linkage clustering
#                    $E_cut_main
#             e= for Evalue cutoff for divide_clusters subroutine.
#             u=
#             l=
#             d=
#             T= for minimal domain size (default is 30 aa residue)
#
#   !! Do not remove the following lines down to # Author line.
#                This program parses them!!
#
#  $factor=                 by f=     ## overlapping factor
#  $Lean_output=L           by L -L
#  $dynamic_factor=y        by y  Y -y -Y
#  $over_write=o            by o -o
#  $create_sso_file=c       by c -c
#  $k_tuple=                by k=
#  $upper_expect_limit=     by u=
#  $lower_expect_limit=     by l=
#  $algorithm=              by a=
#  $No_processing=N         by N -N
#  $single_msp=s            by s -s
#  $sequence_db_fasta=      by DB=
#  $query_file=             by File=
#  $machine_readable=M      by M -M
#  $make_subdir_out=D       by D
#  $make_subdir_gzipped=d   by d -d
#  $direct_MSP_conversion=m by m -m
#  $verbose=v               by v -v
#  $sub_dir_size=           by d=
#  $Evalue_cut_single_link= by E=
#  $Evalue_cut_divclus=     by e=
#  $optimize=z              by z -z
#  $make_separate_summary=u by u -u
#  $length_thresh=          by T=      # minimal sequence domain length threshold
#
# Author    : Sarah A Teichmann, Jong Park, sat@mrc-lmb.cam.ac.uk,
#                                      jong@salt2.med.harvard.edu
# Version   : 1.7
#------------------------------------------------------------------