CompBio(3)     User Contributed Perl Documentation     CompBio(3)


NNNNAAAAMMMMEEEE
       CompBio - Core library for some basic methods useful in
       computational biology/bioinformatics.

SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
       use CompBio;

       my $cbc = new->CompBio;

       $AR_faseqs = $cbc->tbl_to_fa(\@tblseqs);

       or

       use CompBio qw(tbl_to_fa);

       $AR_faseqs = tbl_to_fa(\@tblseqs);

DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
       The CompBio module set is being developed as a new
       implementation of the code base originally developed at
       the BioMolecular Engineering Research Center
       (http://bmerc-www.bu.edu). CompBio.pm is intended to take
       a number of small, commonly used methods for supporting
       bioinformatics research, and make them into a single
       package. Many of the utils included with this distribution
       are just command line interfaces to the methods contained
       herein.

       The CompBio module set is _not_ intended to replace the
       bioperl project (http://www.bioperl.org/). Although I do
       welcome suggestions for improving or adding to the methods
       available in these modules, particularly I would love any
       help with things on the TO DO list, these modules are not
       intended to provide the depth that the bioperl suite can
       provide.

       CompBio has a limited API. It expects it's input to be in
       specific formats, as described in each methods
       description, and it's output is in a format that makes the
       most sense to me for that method. It does no error
       checking by and large, so incorrect input could cause
       bizzare behavior and/or a noisy death. If you want a
       flexible interface with lots of error checking and deep
       levels of verbosity, use the CompBio::Simple manpage -
       that's its job.

       Latest version is available through SourceForge
       http://sourceforge.net/projects/compbio/, or on the CPAN
       http://www.cpan.org/authors/id/S/SE/SEANQ/CompBio-0.461.tar.gz.

       Thanks!

       Other modules available (or that will be available) in the
       CompBio set are:

       The DB module will only be imediately useful if you import
       the databases as used by us here at the
       BMERC(ftp://mcclintock.bu.edu/BMERC/mysql/) or develop
       your own on the same basic design scheme. Otherwise I hope
       you find it useful as a source of design ideas for rolling
       your own. Please note however that I intend to expand the
       methods in CompBio/Simple.pm to allow seemless access to
       data through the DB module. Although at no time will
       including the DB module be required for Simple to work, I
       think that if you have the space for it, you will find
       having the data locally and adding this module (or
       adapting it to work with databases already installed) will
       have an imense and imediate benifiacial impact. It did for
       us at least! :)

       The Profile module was designed to work with our PIMA-II
       sofware and the PIMA modules. The PIMA suite is available
       for license from Boston University for a nominal fee, and
       free for academic use. For examples and more info see
       http://bmerc-www.bu.edu/PIMA/. Unless you have or are
       interested in that package, this module will have no
       functional value.

MMMMeeeetttthhhhooooddddssss
       You may note that the majority of the methods here are for
       converting sequences from one format to another. Mainly
       this is for converting other formats to table format,
       which is used by most of these programs. This is not meant
       to be a comprhensive collection of format guessing and
       transformation methods. If you are looking for a converter
       for a format CompBio doesn't handle, I suggest you look
       into bioperls SeqIO package or the READSEQ program (java),
       found at http://iubio.bio.indiana.edu/soft/molbio/readseq/

       nnnneeeewwww

       Construct an object for invoking methods in CompBio.

       hhhheeeellllpppp

       Quits current application and uses perldoc to display the
       POD.

       cccchhhheeeecccckkkk____ttttyyyyppppeeee

       Checks a given sequence or set of sequences for it's type.
       Currently groks fasta(.fa), table(.tbl), raw genome(.raw),
       intelligenics[?](.ig) and coding dna sequence(.cdna)
       types. Each index of the referenced array should be an
       entire sequence record. * It is however nnnnooootttt recomended
       that you load up an entire raw genome into memory to do
       this test - see the perlfunc read entry elsewhere in this
       document *

       `$type = check_type(\@seqs,%parameters);'

       Possible return types are CDNA, TBL, FA, IG, RAW and
       UNKNOWN.

       Be warned, this is intended only as a quick check and only
       uses as many records as necisarry in the reference
       provided, stating with the first.  check_type assumes the
       rest of the records look the same and does not do any kind
       of deep QA on the set. If you are not sure, invoke
       check_type with a few random samples from your set, or use
       the CompBio::Simple manpage, which does that by default.

       ttttbbbbllll____ttttoooo____ffffaaaa

       Converts a sequence record in table (tab delimited,
       usually .tbl file extension) format to fasta format.

       `$aref_faseqs = $cbc-'tbl_to_fa(\@seqdat,%params);>

       Each index in the @seqdat array must contain entire record
       (loci\tsequence) for single sequence. Return is an array
       reference, still one sequence per index.

       Extra data fields in the table format will be added to the
       annotation line (>) in the fasta output.

       ttttbbbbllll____ttttoooo____iiiigggg

       Converts a sequence in table (tab delimited) format to .ig
       format. Accepts sequences in a referenced array, one
       record per index.

       `$aref_igseqs = $cbc-&gt;tbl_to_ig(\@tbl_seqs,%params);'

       Extra data fields in the table format will be placed on a
       single comment line (;) in the ig output.

       ffffaaaa____ttttoooo____ttttbbbbllll

       Accepts fasta format sequences in a referenced array, one
       complete sequence record per index. This method returns
       the _s_e_q_u_e_n_c_e(s) in table(.tbl) format contained in a
       referenced array.

       `$aref_faseqs = $cbc-'fa_to_tbl(\@fa_seq);>

       Extra data fields in the fasta format will be placed as
       extra tab delimited values after the sequence record.

       Options:

       CLEAN: Reduces signifier to the first strech of non white
       space characters found at least 4 characters long with no
       of the characters (| \ ? / ! *)

       iiiigggg____ttttoooo____ttttbbbbllll

       Accepts ig format sequences in a referenced array, one
       complete sequence record per index. This method returns
       the _s_e_q_u_e_n_c_e(s) in table(.tbl) format contained in a
       referenced array.

       `$aref_igseqs = $cbc-'ig_to_tbl(\@fa_seq);>

       Extra comment lines in the ig format will be placed as
       extra tab delimited values after the sequence record.

       ddddnnnnaaaa____ttttoooo____aaaaaaaa

       Converts a dna sequence, containing no whitespace,
       submited as a scalar reference, to the amino acid residues
       as coded by the standard 'universal' genetic code.  Return
       is a reference to a scalar. dna sequence may contain
       standard special characters (i.e. R,S,B,N, ect.).

       `$aa = dna_to_aa(\$dna_seq,%params);'

       Options:

       C: Set to a true value to indicate dna should be converted
       to it's complement before translation.

       ALTCODE: A reference to a hash containing alternate coding
       keys where the value is the new aa to code for. Stop
       codons are represented by ".". Multiple allowable values
       for the final dna in the codon may be provided in the
       format AT[TCA].

       SEQFIX: Set to true to alter first position, making V or L
       an M, and removing stop in last position.

       ccccoooommmmpppplllleeeemmmmeeeennnntttt

       Converts dna to it's complementary strand. DNA sequence is
       submitted as scalar by reference. There is no return as
       sequence is modified. To maintain original sequence, send
       a reference to a copy.

       `complement(\$dna);'

       ssssiiiixxxx____ffffrrrraaaammmmeeee

       Converts a submitted dna sequence into all 6 frame
       translations to aa. Value submitted may be a reference to
       a scalar containing the dna or a filename.  Either way dna
       must be in RAW(.raw) format (Not whitespace within
       sequence, only one newline allowed at end of record).
       Output id's have start and stop positions encoded; if
       first value is larger, translated from anti-sense.

       `$result = six_frame($raw_file,$id,$seq_len,$out_file);'

       Options:

       ALTCODE: A reference to a hash containing alternate coding
       keys where the value is the new aa to code for. Stop
       codons are represented by ".". Multiple allowable values
       for the final dna in the codon may be provided in the
       format AT[TCA].

       ID: Prefix for translated segment identifiers, default is
       SixFrame.

       SEQLEN: Minimum length of aa sequence to return, default
       is 10.

       OUTPUT: A filename to pipe results to. This is recomended
       for large dna sequences such as large contigs or whole
       chromosomes, otherwise results are stored in memory until
       process complete. If the value of OUTFILE is STDOUT
       results will be sent directly to standard out.

       wwwwuuuu____bbbbllllaaaasssstttt

       A simple interface to the Washington University
       distribution of blast. Only recently tested with full
       2.0MP-WashU release. I do however stick to using the
       compatability features so it it should work with both the
       licensed and free releases of WUblast.

       wu_blast requires two arguments, a sequence database
       filename or a reference to an array containing the
       _s_e_q_u_e_n_c_e(s) to search, and the query sequence filename or
       a reference to an array containing the sequence to query
       with. Sequence files need to be in fasta format and
       sequences contained in arrays need to be in table format.
       For more versatility, use CompBio::Simple::blast. One
       thing wu_blast will do is prepare the db for searching if
       it isn't set up in advance; this does require you have
       write permision in the location of the base sequence file,
       and sufficient room in your quota (if used).

       Options:

       METHOD: Alternate method to use for sequence comparison.
       Default is blastp.

       PARAMS: Optional parameters to hand to the blast method.
       See the your local documentation for the method chosen to
       see what options are available.

       SERVER: Machine to use as the blast server. Defaults to
       the global $CPUSERVER.  rsh is currently used to launch
       the proccess.

       BLASTPREP: Alternate method to use to prepare a sequence
       database for being searched using the chosen blast method.
       Default is setdb.

       `$blast_out = blast($fa_file,$queryfile,(PARAMS =&gt;
       "-B=0 -matrix=BLOSUM62"));'

       AAAAAAAA____HHHHAAAASSSSHHHH

       Creates a hash table with amino acid lookups by codon.
       Includes all cases where even an alternate na code (such
       as M for A or C) would return an unambiguous aa.  Also
       consistent with the complement method in this package, ie,
       lower cases in some contexts, for ease of use with
       six_frame.

        C<%aa_hash = aa_hash;>


EEEEXXXXPPPPOOOORRRRTTTT
       None by default.

HHHHIIIISSSSTTTTOOOORRRRYYYY
       0.01    Original version; created by h2xs 1.20 with
               options

                 -AXC -n CompBio


       0.20    Begin porting core function code for most initial
               methods from various sources at BMERC. Write
               protocode and placeholders for what I want in
               initial 'complete' version.

       0.44    Almost everything eneded up being rewritten
               practically from scratch.  Removed lingering
               locale assumptions and (hopefully) improving
               interface (added use of %params mainly).  Added
               OOP useability so this module should now work
               either way.

       0.45    Modifications to Simple primarilly.

       0.451   Fixed faulty statement in MANIFEST.SKIP that
               excluded Makefile.PL from distribution (thanks to
               Andreas Riechert for pointing this out to me
               almost imediately).


       0.46    Added some documentation, mostly on options.
               Added the ALTCODE option to six_frame.  Finished
               modifying tbl_to* converters to make a pass at
               handeling extra data feilds from table format and
               changing *_to_tbl converters to keep extra data
               besides id in extra fields in table format.

       0.461   Started working on bringing the utility programs
               included into using CompBio::Simple, using
               dna_to_aa as first trial. Also some small changes
               to docs.

       0.464   Major updates and alterations. Got most of the
               CompBio utils at least minimally working. Added
               $RET_CODE usage to CompBio, setting old behavior
               as default, but allowing Simple to provide as a
               parameter, so that return formating (such as
               setting dna_to_aa to return in fasta) and
               file/STDOUT behavior is done during process,
               reducing memory usage. Matching major alterations
               to Simple.

       0.466   NOTE .464 was acidently never fully distributed as
               such. Sorry!

               Project has been added to SourceForge and the
               modules list on CPAN.

               Worked out some more bugs, mostly cropping up from
               interactions with utils such as tbl_to_fa, which
               interface through Simple. Mainly found places
               where lazy programming broke using the files tied
               to an array, so had to use less of $_. Things are
               shaking out nicely, and I think all the core bits
               are now in place. Once the utils are done, I'll do
               a major code cleanup and comment run and step up
               to 0.5 and push release. Finishing adding
               converters and fixing any reported bugs (new
               converters shouldn't add any new bugs, except
               inside the new methods) will benchmark going to
               0.6.

TTTTOOOO DDDDOOOO ---- iiiinnnn nnnnoooo ppppaaaarrrrttttiiiiccccuuuullllaaaarrrr oooorrrrddddeeeerrrr
       I like the basic design so far - except where large
       sequence sets are to be munged.  There this becomes a
       seriouse memory hog. Some faster & more efficient way
       needs to be provided when dealing with files & large data
       sets.

       complement and dna_to_aa need to be loop enabled, also
       alowing &$RET_CODE(\@ret,$str); to be used.

       **Got it - needs work!** Get the full release of WashU
       Blast and make sure the blast stuff works with it as well.

       OK, rethink blast _m_e_t_h_o_d(s) completely. Old and kludged
       and needs to die.

       Add a method for handeling ncbi blast. CompBio::Simple
       should only have a blast method that will DTRT and use the
       appropriate core methods.

       Add DNA* and GCG format handelers

       Add handler for the genbank report type format: 1   attgc
       gtgct 11  gtgtg   gacaa Which, though annoying, seems a
       certain candidate for recieving in cut and paste
       operations, particularly through the web.

       Better way to handle CPUSERVER. There must be some way to
       allow the user to define (presumably during ./config?) how
       to submit jobs for more intensive computational tools such
       as Blast and PIMA. Things like rsh, submitting to a batch
       queue, etc should all be definable at install somehow.

       write configure script to populate these globals as part
       of install process.

       modify tests to look for inclusion of Profile.pm or such
       and skip testing where apropriate if the PIMA suite is not
       installed.

       Find out if there is a way using OOP to allow a user to
       only need to include this module and create new objects
       for the submodules, or even better just DTRT. Can this be
       done just through inheritence? Should it be triggered by
       requested export (like use CompBio qw(Simple DB)? Or
       through an AUTOLOAD type interface, returning the correct
       object ($cbs = CompBio->Simple("new") or some such)? I
       think that would be far more desirable than _having_ to
       use a bunch of modules all in the CompBio namespace.

       _error (all packages) needs to correctly report line where
       error occured.  Can this be done through caller or do I
       need to pass manually?

       hmmm, how about a simple tm calculator for primers? I seem
       to be able to find lots of them that work through web
       interface, or more complex primer selection software, but
       none that plug and play to a perl program, and I'm needing
       one. Not a primer selecter, just a _solid_ (more than just
       GC content), straight foward Tm calculator.

CCCCOOOOPPPPYYYYRRRRIIIIGGGGHHHHTTTT
       Developed at the BioMolecular Engineering Research Center
       at Boston University under the NHLBI's Programs for
       Genomic Applications grant.

       Copyright Sean Quinlan, Trustees of Boston University
       2000-2001.

       All rights reserved. This program is free software; you
       can redistribute it and/or modify it under the same terms
       as Perl itself.

AAAAUUUUTTTTHHHHOOOORRRR
       Sean Quinlan, seanq@darwin.bu.edu

       Please email me with any changes you make or suggestions
       for changes/additions.  Latest version is available
       through SourceForge the section on
       "/sourceforge.net/projects/compbio/" in the http: manpage,
       or on the CPAN the section on
       "/www.cpan.org/authors/id/S/SE/SEANQ/CompBio-0.461.tar.gz"
       in the http: manpage.  Thank you!

       I would like to thank the staff at the BMERC for being my
       guinee pigs for the past few years, Jim Freeman who got me
       into this and left me with flint and tinder rather than
       the ember in a clay pot he started with, and my boss and
       (tor)mentor Dr. Temple F. Smith! I would never have gotten
       this far or learned this much without them.

SSSSEEEEEEEE AAAALLLLSSSSOOOO
       _p_e_r_l(1), _C_o_m_p_B_i_o_:_:_S_i_m_p_l_e(1).


2001-11-13                 perl v5.6.0                          1