Author : J. Park, Jason Johnson, Sarah Teichmann, Alex Bateman,
Astrid Reinhardt, and anybody contributed. jong@salt2.med.harvard.edu
Example : require "B.pl"; BUT, I recommand you take subroutines out and
use it directly or modify in your programs.
Function : This is a comprehensive perl subroutine library developed
under Bioperl project and others.
URL: http://cyrah.med.harvard.edu/Bioperlsub/
This serves as the depository database for various perl subroutines
or algorithms developed in Bioinformatics and Genome projects.
You can copy any of the sub routines in this file, modify, use
in yours...
PLEASE MODIFY AS FREELY AS YOU WANT !! All has the same PERL copyright
All the subroutines are tested in small files
If you want to have such single example program
to see how they really work, pls contact me( A Biomatic )
For example, a file called 'handle_arguments.pl' exists to
test the subroutine 'handle_arguments'. Usually you can find them
in http://cyrah.med.harvard.edu/B.pl.html
Keywords : Biology, perl library, sequence handling lib
Options : nothing (used as subroutine library)
Usage : require "B.pl"; ##<-- This is very slow, so you'd better
copy the subroutines in your prog. or make a smaller lib files
which are classified according to functions(like, Bio_Seq.pl
for sequence handling, Bio_Array.pl for various array
subroutines..), or make your own module out of this, do whatever
you want....
Version : 1.8 (April/27/1998)
Warning : CopyLEFTed, for the enhancement of Biology, Biomatics, and Science.
This is a development companion, nothing else.
Class is for classification of my subroutines. If it is B, it can
be useful for biological sequence data handling. If it's Utility,
it can also be used for general purpose file handling stuff.
File, Array, Hash,... are my classification items.
Argument : any type, any amount
Category : general programming
Example : 'handle_arguments(\@array, $string, \%hash, 8, 'any_string')
Function : Sorts input arguments going into subroutines and returns default
arrays of references for various types (file, dir, hash, array,,,,)
If you give (\@out, @file), it will put @out into @array as a ref
and also the contents of @out will be dereferenced and put to
raw_string regardless what is in it).
Keywords : handling arguments, parsing arguments,
Returns : Following GLOBAL variables
$num_opt, @num_opt @file @dir
$char_opt, @char_opt %vars @array,
@hash @string, @raw_string @range,
$num_opt has 10,20
@num_opt has (10, 20)
@file has xxxx.ext
@dir has dir or /my/dir
$char_opt has 'A,B'
@char_opt has (A, B)
@array has (\@ar1, \@ar2)
@hash has (\%hash1, \%hash2)
@string ('sdfasf', 'dfsf')
@raw_string (file.ext, dir_name, 'strings',,)
@range has values like 10-20
%vars deals with x=2, y=3 stuff.
Usage : Just put the whole box delimited by the two '###..' lines below
to inside of your subroutines. It will call 'handle_arguments'
subroutine and parse all the given input arguments.
To use, claim the arguments, just use the variable in the box.
For example, if you had passed 2 file names for files existing
in your PWD(or if the string looks like this: xxxx.ext),
you can claim them by $file[0], $file[1] in
your subroutine.
Version : 4.8
Function : it sorts by the 2nd column(E-value, in msp file), small comes top
Keywords : sort_by_2nd_column, sort_by_second_column, sort_by_e_values,
sort_by_evalues,
Usage : @out=@{&sort_by_E_values(\@input_line_array)};
Version : 1.0
Example : Above will sort the file xxxx.msp by its 3rd column(numerically)
small numbers will come to the top.
Function : it sorts values of hash by the given column , small comes top. Unless number is
is given, it sorts by the first column.
It returnns ARRAY of the keys of the input HASH!!!
It can handle gzipped file. It called gunzip to open and sort.
Keywords : sort_by_2nd_column, sort_by_second_column, sort_by_e_values,
sort_by_evalues, sort_hash_by_column, sort_value_by_column,
Options :
s for sorting stringwise
d for sorting by digit
n for sorting by digit(numerically)
numerically an alias of n
Usage : @out=@{&sort_by_column(\%input_line_hash, )};
Version : 1.1
Example : sort_by_column.pl 3 xxxx.msp
Above will sort the file xxxx.msp by its 3rd column(numerically)
small numbers will come to the top.
Function : it sorts by the given column , small comes top. Unless number is
is given, it sorts by the first column.
It can handle gzipped file. It called gunzip to open and sort.
Keywords : sort_by_2nd_column, sort_by_second_column, sort_by_e_values,
sort_by_evalues,
Options :
s for sorting stringwise
d for sorting by digit
n for sorting by digit(numerically)
Usage : @out=@{&sort_by_column(\@input_line_array, )};
Version : 1.4
Function : it sorts by the 1st digit before '-' as in 2-183_cluster, 2-140_cluster,
etc.
Keywords : sort_by_columns, sort_by_text_columns, sort_by_column_numerically
sort_by_pattern
Usage : @out=@{&sort_by_cluster_size(\@input_line_array)};
Version : 1.2
Function : it sorts by the 2nd column(E-value, in msp file), small comes top
by the help of ts Keywords : sort_by_columns, sort_by_text_columns, sort_by_column_numerically
Usage : @out=@{&sort_by_column_bigger_first(\@input_line_array, 1)};
Version : 1.1
Function : @matrix is like $matrix[1][2]=1;
This assigns number 1 to array element
If one array is given, it makes self to self matrix.
When 2 are given, make matrix for the 2
Keywords : make_matrix
Options :
$skip_gap_char = g for skipping gap char (any special char)
Usage : @matrix=@{&make_2D_identity_matrix(\@seq1, \@seq2)};
Version : 1.2
Function : @matrix is like $matrix[1][2]='A'; when aa residue is identical
This assigns identical residue to array element
If one array is given, it makes self to self matrix.
When 2 are given, make matrix for the 2
Keywords : make_matrix
Usage : @matrix=@{&make_2D_aa_residue_matrix_array(\@seq)};
Version : 1.1
Function : @matrix is like $matrix[1][2]=1;
This assigns number 1 to array element
Keywords : make_matrix, make_identity_matrix
Options :
s for show axis
Usage : @matrix=@{&make_2D_identity_matrix(\$seq, [\$seq2] )};
Version : 1.2
Example :
OUTPUT looks like the following;
d1dvh__=d1fcdc1 7.1e-08
d1fcdc1=d1dvh__ 7.1e-08
d5cytr_=d351c__ 5.3e-08
d351c__=d5cytr_ 5.3e-08
d1cyi__=d2mtac_ 9.1e-06
d2mtac_=d1cyi__ 9.1e-06
d1cyi__=d5cytr_ 0.00045
d5cytr_=d1cyi__ 0.00045
:
INPUT looks like this: (the iss file format), first column is key
d1ten__(110)(0.00031) d1fna__ d1fna___1-91(578)(6.9e-37) d1ten__(110)(0.00031)
d1cfb_2(255)(7.8e-16) d1cfb_2 HSU55258_741-838(255)(5.6e-12) d1cfb_2(255)(7.8e-16)
OUTPUT looks like the following;
d1dvh__=d1fcdc1 Correct: 7.1e-08
d1fcdc1=d1dvh__ Correct: 7.1e-08
d5cytr_=d351c__ Correct: 5.3e-08
d351c__=d5cytr_ Correct: 5.3e-08
d1cyi__=d2mtac_ Wrong: 9.1e-06
Function : gets sequences which are wrongly matched from intermediate seq search
: makes a table of match with the values for E values.
Keywords :
: make_sequence_match_Evalue_table, Evalue_table, make_Evalue_table
make_iss_sequence_match_table
Options : _ for debugging.
# for debugging.
: _ for debugging.
# for debugging.
s for skip SELF to SELF match entries
w for Smith-Waterman score result out than E value out
r for reflexive output
Reference : http://sonja.acad.cai.cam.ac.uk/perl_for_bio.html
Usage : %seq=%{&get_false_positive_seq_matches(\%msp_1, \%msp2)};
: %sequence_match_table=%{&make_sequence_match_table(\%msp_1, \%msp2)};
Version : 1.0
: 1.5
Warning : The default is to show the best E value(lowest that is)
Function : writes the intermediate sequence search file.
Keywords : write_interm_seq_search_file
v for showing the output in STDOUT
Reference : http://sonja.acad.cai.cam.ac.uk/perl_for_bio.html
Usage : &write_iss_file(\%msp1, \%msp2); ## for 2 msp_x file input
Version : 1.2
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Category : statistics, search, bio
:
Keywords : get_stat00_result, get_stat_msp0_files, get_stat_single_search_result
Options :
$E_value= by e=
$verbose=v by v
$show_options=o by o
$step = by s=
$score_thresh1= by t1=
$score_thresh2= by t2=
$E_mult_factor1 = by m1=
$E_mult_factor2 = by m2=
Usage : &get_stat_FASTA_search_result_in_msp_0_files(\@file);
Version : 1.0
Example :
: %index=%{&open_sequence_index_files(\@INDEX_FILE, \@input_seq_names)};
Function :
: returns seqname with its seek pos in fasta sequence db file.
Keywords : remove_sequence_ranges, remove_sequence_name_ranges,
remove_ranges_in_sequences, strip_sequence_name_ranges,
: open_seq_index_files, open_seq_idx_files, open_idx_files,
get_sequence_index, get_seq_index, get_sequence_with_index
Options : _ for debugging.
# for debugging.
: _ or # for debugging
Usage :
: open_sequence_index_files(, );
Version : 1.0
: 1.2
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
: You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Example : &do_intermediate_sequence_search(\%pdb_seq, $owl_db_fasta, $ARGV[0], $single_msp, $over_write,
"u=$upper_expect_limit", "l=$lower_expect_limit", "k=$k_tuple" );
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
m for MSP format directly from FASTA or Ssearch result than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
R for adding ranges to the enquiry sequences as well.
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
Returns : the names of files created (xxxxx.msp, yyy.msp,,)
Usage : &do_intermediate_sequence_search(\%pdb_seq, $owl_db_fasta, $ARGV[0], $single_msp, $over_write,
"u=$upper_expect_limit", "l=$lower_expect_limit", "k=$k_tuple" );
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Example : &do_sequence_search(\%pdb_seq, $owl_db_fasta, $ARGV[0], $single_msp, $over_write,
"u=$upper_expect_limit", "l=$lower_expect_limit", "k=$k_tuple" );
Function : do FASTA, SSEARCH or BLASTPGP(psi-blast) search
Keywords : sequence_search
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
m for MSP format directly from FASTA or Ssearch result than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
s= for the single big msp file name
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
d for very simple run and saving the result in xxxx.gz format in sub dir starting with one char
r for reverse the query sequence
R for attaching ranges of sequences
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
d= for defining the size of subdir made. 2 means it creates
eg, DE while 1 makes D
d for $make_gz_in_sub_dir_opt, putting resultant sso files in gz format and in single char subdir
D for $make_msp_in_sub_dir_opt, convert sso to msp and put in sub dir like /D/, /S/
n for new format to create new msp file format with sso_to_msp routine
PVM= for PVM run of FASTA (FASTA only)
M for machine readable format -m 10 option
M= for machine readable format -m 10 option
N for 'NO' do not do any processing but, do the searches only.
FILE_AGE for defining the age of file in days to be overwritten.
L for Lean output(removes xxxx.fa query seq file)
Returns : the names of files created (xxxxx.msp, yyy.msp,,)
Usage : &do_sequence_search("Query_seqs=\%pdb_seq", "DB=$sequence_db_fasta",
"File=$ARGV[0]", $single_msp, $over_write,
"u=$upper_expect_limit", "l=$lower_expect_limit",
"k=$k_tuple", $No_processing );
Version : 5.1
Function : does hmm sequence search using Sean Eddy's HMMER (hmmls, hmmfs)
Keywords : do_seq_search_with_hmm, do_hmmt_sequence_search
Options :
"method=ls" for turning hmmls search option on (default)
"method=fs" for turning hmmfs search option on
method= by method=
o for overwriting existint xxxxx.hmm files
E=Enguiry_name for specifying enquiry seq name rather than 'HMM', the default
t=20 for score thresh at the level of hmmls. Default of hmmls is 0. example showed has 15
$evalue_cutoff= by e=
$over_write = o by -o o
Usage : &do_hmm_sequence_search(\@file, "method=$default_search_method",
$over_write, "DB=$pdbd40_seq_fasta");
Version : 1.6
Example : ÷_clusters(\@file, $verbose, $range, $merge, $sat_file,
$dindom, $indup, "T=$length_thresh", "e=$evalue", $over_write,
$optimize, "s=$score", "f=$factor");
Function : This is the main funciton for divclus.pl
divides complex single linkage cluster into smaller duplication
module level sub clusters.
Keywords : divicl, divclus, div_clus, divide clusters
Options : _ for debugging.
f= for determing the factor in filtering out non-homologous
regions, 7 = 70% now!!
l= for seqlet(duplication module) length threshold
t= for seqlet(duplication module) length threshold
(same as l opt, confusing, huh? )
s= for score threshold
e= for evalue threshold
z for activating remove_similar_sequences, rather than remove_dup....
o for overwriting
v for verbose printout (infor)
D for dynamic factor
S $short_region= S by S -S # taking shorter region overlap in removing similar reg
L $large_region= L by L -L # taking larger region overlap in removing similar reg
A $average_region=A by A -A # taking average region overlap in removing similar reg
Usage : ÷_clusters(\@file);
Version : 2.8
Example : @seqlets=@{&remove_similar_seqlets(\@mrg1, $mrg2, \@mrg3)};
while @mrg1=qw(M_2-100 M_2-110 M_8-105 M_4-108 N_10-110 N_12-115);
$mrg2='Z_3-400 Z_2-420';
@mrg3=('X_2-300 X_3-300', 'X_2-300', 'X_5-300 X_2-301' );
Function : merges(gets average starts and ends ) of similar
seqlets to reduce them into smaller numbers. This can also handle
names like XLBGLO2R_8-119_d1hlm__.
Keywords : merge_sequence_names, merge_seq_names, merge_sequence_ranges
merge_seq_ranges
Options : _ for debugging.
# for debugging.
f= for factor
S for shorter region matched is used
A for average region matched is used
L for larger region matched is used
Usage : @seqlets=@{&remove_similar_seqlets(\@split)};
Version : 2.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Example : @temp_show_sub=&show_subclusterings(\@out, $file, $sat_file, $dindom, $indup);
Function : This is the very final sub of divclus.pl
Keywords : print_subclusterings, sum_subclusterings, write_subclustering
show_clusterings, display_subclusterings
Options :
f for file output, eg: xxxxxxx.sat
Usage : &show_subclusterings(\@out);
Version : 2.6
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Keywords : find_dindoms, domain_inside_domain, domain_in_domain
Options : _ for debugging.
# for debugging.
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : used to make things like:
Options : _ for debugging.
# for debugging.
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : This reads MSP file regions matched for a target seq
and adds things up to plot horizontally.
Options : _ for debugging.
# for debugging.
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Options : _ for debugging.
# for debugging.
$short_region= S by S -S # taking shorter region overlapped in removing similar regions
$large_region= L by L -L # taking larger region overlapped in removing similar regions
$average_region=A by A -A # taking average region overlapped in removing similar regions
Usage : @out=@{&cluster_merged_seqlet_sets(\@lines)};
Version : 1.5
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : connects two clusters of seqlets if they share
identical or near identical seqlets
Keywords : check_link, check_relation, check_relatedness
Options : _ for debugging.
$factor = by f= # eg) "f=$factor" in the higher level sub
Version : 1.7
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : merges arrays if there are common array elements.
if @A has (1,2,3) and @B has (2, 4, 5), they share 2, so
they are merged to be (1,2,3,4,5)
Keywords : cluster_arrays_by_common_elements, merge_arrays_if_common_elements
merge_array_if_common_elements, merge_arrays_when_common_elements_occur
merge_arrays
Usage : @out=@{&merge_arrays_by_common_elements(\@ref_of_arrays)}
Version : 1.1
Example :
PARF file looks like this>
d1nsca_ d3nn9__ Homolog -664.92 2.43.1.1.3 2.43.1.1.2
d1dppa_ d2olba_ Homolog -617.41 3.68.1.1.6 3.68.1.1.1
d2ach.1a1 d9api.1a1 Homolog -556.38 5.2.1.1.3 5.2.1.1.4
Function : checks if given file(s) is a parf file and returns the number of
identified parf file. If you check 2 files and both are parf, you
will get (\$num_of_parf_file) value of 2.
Usage : $number_of_parf=${&check_parf_files(@input)};
Version : 1.0
Function : accepts 1 or 2 refs of arrays and checks if there is any
common(repeating) elements between the two (or inside one)
The result is either ref of 1, or 0
Keywords : is_there_common_element, if_common_elements
Usage : &check_common_elements_in_array($mother_array[$i], $mother_array[$i+1]));
Version : 1.0
Example : INPUT:
@input=( '1-30 1-40 1-50',
'2-49 4-40 2-99'....)
Function : merges ranges(10-20, 11-21 etc) when there is any overlap
is present
If you put a reverse range like '2000-20', it will
complain and reverse the order and do the job after correction.
Keywords : connect_ranges, link_overlapping_ranges, connect_overlapping_ranges
Options : _ for debugging.
Usage : @all_ranges = @{&link_ranges(@all_ranges)};
Version : 1.1
Example : INPUT:
@input=( '1-30 1-40 1-50',
'2-49 4-40 2-99'....)
Function : merges ranges(10-20, 11-21 etc) when there is any overlap
is present (resulting in average start and end at each level)
If you put a reverse range like '2000-20', it will
complain and reverse the order and do the job after correction.
Keywords : merge_similar_regions, merge_ranges, merge_regions,
merge_sequence_ranges, merge_overlap_ranges, connect_ranges
connect_overlapping_ranges, connect_similar_ranges,
remove_similar_ranges
Options : f= for setting factor (0.7 for 70% overlap minimum)
Usage : @all_ranges = @{&merge_similar_seqlets(@all_ranges)};
Version : 1.3
Example : INPUT:
@input=( 'seq1_1-30 seq2_1-40 seq3_1-50',
'seq1_2-49 seq3_4-40 seq4_2-99'....)
@output=('seq1_1-30 seq2_1-45 seq3_2-45 seq4_2-99');
Function : merges seqlet sets which have identical
sequences and share similar regions by connection factor of 30%
This means, if any two seqlets from the same sequences which
share more than 70% seqlet regions overlapping are merged
This only sees the very first sequence in the seqlets line!!!
(so, PARTIAL MERGE !!)
Keywords : merge_similar_sequences, merge_sequence_names, merge_sequences,
merge_sequence_ranges, merge_similar_sequences_with_ranges,
merge_seqlets, merge_duplication_modules
Options :
f= for determing the factor in filtering out non-homologous
regions, 7 = 70% now!!
l= for seqlet(duplication module) length threshold
z for activating remove_similar_sequences, rather than remove_dup....
S $short_region= S by S -S # taking shorter region overlap in removing similar reg
L $large_region= L by L -L # taking larger region overlap in removing similar reg
A $average_region=A by A -A # taking average region overlap in removing similar reg
Usage : @all_seqlets = @{&merge_similar_seqlets(@all_seqlets)};
Version : 2.0
Function : sorts arrays of strings like
MJ0228_314-573 MJ1197_348-601
MJ0228_451-576 sll0078_502-594 sll1425_489-611
MJ0228_479-572 sll0078_502-594
According to the digits after seq names _314-, _451-, _479-
in the above
This only looks at the very first sequence in the string
Options : _ for debugging.
# for debugging.
Version : 1.4
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : sort words in strings sperated by ' ' or "\n"
Keywords : sort_words_in_sequences, sort_sequences_in_string,
sort_strings_in_string, sort_string_by_words, sort_elements_in_string
Options : _ for debugging.
# for debugging.
Version : 1.1
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Keywords : convert_hmmls_to_msp
Options :
S=$single_out_file_name for producing single msp file with all the hmmls contents
E=Enguiry_name for specifying enquiry seq name rather than 'HMM', the default
$bit_score_threshold= by t=
Usage : @out=@{&convert_hmmls_to_msp_files(\@file)};
Version : 1.4
Example :
Example OUT as string
slr1950 sll1920 sll0672 sll1076 sll1614 slr0797 slr0798 slr0822 slr1729
slr1729 sll1076 sll0672 sll1614 sll1920 slr0797 slr0798 slr0822 slr1950
Options : _ for debugging.
# for debugging.
Version : 1.1
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : this adds ranges to the seqnames of msp files
mmp line is msp line with additional sequences at the end
Keywords : convert_msp_to_mmp, convert_msp, convert_msp_2_mmp
change_msp_to_mmp, add_range_in_msp, convert_msp_line_to_mmp_line
Options : _ for debugging.
# for debugging.
Version : 1.5
Function : this adds ranges to the seqnames of msp files
mmp line is msp line with additional sequences at the end
Keywords : convert_msp_to_mmp, convert_msp, convert_msp_2_mmp
change_msp_to_mmp, add_range_in_msp
Options : _ for debugging.
# for debugging.
Version : 1.5
Keywords : combine_sequence_alignment, merge_sequence_alignment_pairs
merge_seq_alignment, make_interm_alignment, make_3_way_alignment
merge_alignment, combine_alignment
Options :
l= for sequence block length by print_seq_in_block subroutine
t= for specifying the length of seq names shown.
t for truncating the seq names in printing out.
s for sorting the final output lines (default anyway for print_seq_in_block)
Usage : &merge_sequence_alignments(@seq); while @seq has
@seq=(\%hash1, \%hash2); while %hash1 and %hash2 have
%hash1=qw(seq1 ANN-NTMQQRRQQQRKRRRQQQSSSSTTST seq2 --NNN--QQ--QQQ--RRRR--SSSS--);
%hash2=qw(seq2 NN-QQQQQ--RRRR----SS--SS--- seq3 -NNXQQQXQRTRRRXTTSTSSMMSSTTT);
Version : 1.3
Example : INPUT: (MSP file) ===>
59 2.6 47 64 d2pia_3 10 30 d1erd___10-30
161 1.1e-07 24 91 d2pia_3 16 85 d1frd___16-85
722 0 1 106 d1put__ 1 106 d1put___1-106
66 4.9 2 68 d1put__ 43 106 d2lbp___43-106
69 1.3 12 49 d1put__ 81 120 d1cgo___81-120
60 3.3 13 38 d1frd__ 32 57 d1orda1_32-57
65 1.7 21 58 d1frd__ 40 69 d2mtac__40-69
==== OUTPUT ===>
d1frd___1-98 d1frd___1-98_1-98 d1frd___16-85 d2pia_3_24-91_24-91
d1frd___16-85_16-85 d2pia_3_24-91
d1put___1-106 d1put___1-106_1-106
d2pia_3_1-98 d2pia_3_1-98_1-98
Keywords : mergr_seq_in_msp_file, merge_sequence_in_msp, merge_sequences_in_msp_file
Options :
$dynamic_factor = y by y -y # adjusting factor value dynamically(more seq higher factor)
$short_region = S by S -S # taking shorter region overlapped in removing similar regions
$large_region = L by L -L # taking larger region overlapped in removing similar regions
$average_region = A by A -A # taking average region overlapped in removing similar regions
Version : 2.7
Function : merges sequences which are linked by common regions
This filters the sequences by evalue and ssearch score
This is the main algorithm of merging similar sequences.
Keywords : connect_sequence_in_msp, link_sequence_in_msp_chunk
connect_sequence_in_msp_chunk, link_sequence_in_msp
merge_sequence, link_sequence, connect_sequence
Options : _ for debugging.
# for debugging.
m for merge file output format (.mrg)
t= for threshold of seqlet length eg) "t=30"
f= for overlap factor (usually between 2 to 7 )
2 means, if the two regions are not overlapped
by more than HALF of of the smaller region
it will not regard as common seqlet block
s= for ssearch score minimum
e= for ssearch e value maximum
S for S -S # taking shorter region overlapped in removing similar regions
L for L -L # taking larger region overlapped in removing similar regions
A for A -A # taking average region overlapped in removing similar regions
Version : 2.4
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Keywords : get_overlapping_range_in_msp, get_overlapping_range_in_msp_file,
get_overlapping_seq_match_range, get_overlap_seq_match_range
Options : _ for debugging.
# for debugging.
Usage : @n1=@{&get_overlapping_range(\@ranges1, \@ranges2)};
Version : 1.1
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Options : _ for debugging.
# for debugging.
Usage : This finds the correct msp chunk with given seq name
and big original or any msp chunk
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : accepts msp file and finds the central sequence.
central sequence is in the centre of all the member
sequences in a group or cluster
Options : _ for debugging.
# for debugging.
Version : 1.1
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : write Alex's domfam file. it prints out tilde lines
if the seqlet matched are below threshold defined.
Options : _ for debugging.
# for debugging.
v for verbose STDOUT
n for NO seq start and end number display
t= for teshold (eg, t=40 for Blastp(or ssearch) score 40 threshold)
Usage : &write_dof_files(\@msps);
while @msps means msp file names
Version : 1.2
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : this is the core of check_genome_cluster.pl
finds good linkage seqlets in msp files
Options : _ for debugging.
# for debugging.
Version : 1.0
Function : reads in a big single linkage cluster file(or normal cluster file)
and creates a big msp file which contains all the entries in the
cluster file (usually with the extension of sclu or clu)
This normally reads in xxxx.mso, xxxx.sso like files, but if the
corresponding xxx.msp file already exists, it concatenates them to
make a bigger one.
Keywords : clu_2_sso_2_msp, cluster_to_msp, cluster_to_sso_to_msp
convert_clu_to_sso_to_msp
Options : USE, convert_clu_to_sso_to_msp, this is obsolute now
Usage : &clu_to_sso_to_msp(\$clu);
Version : 1.7
Function : reads in a big single linkage cluster file(or normal cluster file)
and creates a big msp file which contains all the entries in the
cluster file (usually with the extension of sclu or clu)
This normally reads in xxxx.mso, xxxx.sso like files, but if the
corresponding xxx.msp file already exists, it concatenates them to
make a bigger one.
Keywords : clu_2_sso_2_msp, cluster_to_msp, cluster_to_sso_to_msp
clu_to_sso_to_msp
Usage : &clu_to_sso_to_msp(\$clu);
Version : 1.8
Example : &convert_sso_to_msp(@ARGV, 'OUT.msp', $single_out_opt);
Function : This takes sso file(s) and produces MSP file. It
concatenate sso file contents when more than one
sso file is given.
Options : _ for debugging.
# for debugging.
v for showing the MSP result to screen
s for making single MSP file for each sso file
as well as big MSP file which has all sso
u= for upper expectation value limit
l= for lower expect val limit
s= for single file name input eg. "s=xxxxx.msp"
n for new format (msp2 format)
r for adding range
r2 for adding ranges in all sequence names
Returns : the file names created (xxxx.msp, yyyy.msp,,,,)
Usage : &convert_sso_to_msp(@ARGV, $single_out_opt);
Version : 2.6
Warning : This capitalize all the input file names when
producing xxxxx.msp. xxxxx.sso -> XXXX.sso
Example : &sso_to_msp(@ARGV, 'OUT.msp', $single_out_opt);
Function : This takes sso file(s) and produces MSP file. It
concatenate sso file contents when more than one
sso file is given.
Keywords : sso_file_to_msp_file, convert_sso_to_msp,
Options : _ for debugging.
# for debugging.
v for showing the MSP result to screen
s for making single MSP file for each sso file
as well as big MSP file which has all sso
u= for upper expectation value limit
l= for lower expect val limit
s= for single file name input eg. "s=xxxxx.msp"
n for new format (msp2 format)
r for adding range
r2 for adding ranges in all sequence names
Returns : the file names created (xxxx.msp, yyyy.msp,,,,)
Usage : &sso_to_msp(@ARGV, $single_out_opt);
Version : 2.6
Warning : This capitalize all the input file names when
producing xxxxx.msp. xxxxx.sso -> XXXX.sso
Example : &convert_sso_to_msp(@ARGV, 'OUT.msp', $single_out_opt);
Function : This takes sso file(s) and produces MSP file. It
concatenate sso file contents when more than one
sso file is given.
Keywords : sso_file_to_msp_file, convert_sso_to_msp,
Options : _ for debugging.
# for debugging.
v for showing the MSP result to screen
s for making single MSP file for each sso file
as well as big MSP file which has all sso
u= for upper expectation value limit
l= for lower expect val limit
s= for single file name input eg. "s=xxxxx.msp"
n for new format (msp2 format)
r for adding range
r2 for adding ranges in all sequence names
Returns : the file names created (xxxx.msp, yyyy.msp,,,,)
Usage : &convert_sso_to_msp(@ARGV, $single_out_opt);
Version : 2.6
Warning : This capitalize all the input file names when
producing xxxxx.msp. xxxxx.sso -> XXXX.sso
Function : matched each query seq name and if the E value is lower than
my arbitrary threshold, I put the subject and target pair
alignment into a hash.
In later iterations, the latest is replaced
Keywords : convert_bla_to_msf
Usage : @msf_file_made=@{&bla_to_msf(\@bla_file)};
Version : 1.1
Function : matched each query seq name and if the E value is lower than
my arbitrary threshold, I put the subject and target pair
alignment into a hash.
In later iterations, the latest is replaced
Keywords : convert_bla_to_msf
Usage : @msf_file_made=@{&convert_bla_to_msf(\@bla_file)};
Version : 1.1
Author : Sarah Teichmann and Jong Park, jong@salt2.med.harvard.edu
Example : %hash_out=%{&convert_bla_to_msp(\$file)};
Function : reads in PSI blast output and produces MSP file format.
Takes all the good hits below certain threshold in multiple iteration
Reports the best evalue with a given sequence name
Keywords : pbla_to_msp, blast_to_msp, bla_2_msp, blastp_to_msp_format,
blast_to_msp_format, convert_bla_to_msp, convert_bla_to_msp_files
bla_to_msp
Options :
$pdbd_seq_only d for getting dxxxx_ like seq names only(pdb40d names for examp)
$all_seq a for forcing all seq conversion
$which_iteration= by i= # choose which iteration result you want to take
$which_iteration as just a digit
$report_only_the_best=b by b -b
$take_only_the_last_iteration=l by l
$accumulative_hits_eval_thresh= by e=
$genome_seq_only=g
$nrdb_seq_only=n
$evalue_thresh= by E=
$Accumulate_matches=A by A -A
Usage : %hash_out_final=%{&convert_bla_to_msp(\$file)};
Version : 3.7
Example : @msf_file_made=@{&convert_bla_multaln_to_msf(\@bla_file,
$verbose, "i=$iteration")};
Function : matched each query seq name and if the E value is lower than
my arbitrary threshold, I put the subject and target pair
alignment into a hash.
In later iterations, the latest is replaced,
when you use m6 option for PSI blast
this adds '00x' extensions to the repeatedly occurring seq names
Keywords : psi_blast_to_msf, psi_blast_multaln_to_msf
Options :
i=$iteration
v for verbose
Usage : @msf_file_made=@{&convert_bla_multaln_to_msf(\@bla_file, [i=2])};
Version : 1.6
Example : @msf_file_made=@{&convert_bla_multaln_to_msf(\@bla_file, "i=$iteration")};
Function : matched each query seq name and if the E value is lower than
my arbitrary threshold, I put the subject and target pair
alignment into a hash.
In later iterations, the latest is replaced,
when you use m6 option for PSI blast
this adds '00x' extensions to the repeatedly occurring seq names
Keywords : psi_blast_to_msf, psi_blast_multaln_to_msf,
bla_multaln_to_msf
Usage : @msf_file_made=@{&convert_bla_multaln_to_msf(\@bla_file, [i=2])};
Version : 1.4
Function : fetches hash keys and values by giving keys to
a hash
Keywords : subhash, sub_hash, get_hash_elements, fetch_sub_hash
take_sub_hash, get_hash_by_keys, get_sub_hash_by_keys
Options : _ for debugging.
# for debugging.
Usage : %sub_hash=%{&get_sub_hash(\%FASTA, \@list)};
Version : 1.1
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : checks the size of files and returns the smallest
one's name. If a file is not present in pwd or
specified absolute path, it ignores it.
Keywords : choose_smallest_file, smallest_file, find_smallest_file
get_the_smallest_file, choose_the_smallest_file,
fetch_smallest_file, take_smallest_file, get_smaller_file,
Options : _ for debugging.
# for debugging.
e for extract the smallest from the input array
leaving it one element less, in this case
there will be two returning refs.
Usage : $smallest_file_name=${&get_largest_file(@ARGV)};
Version : 1.3
Function : checks the size of files and returns the largest
one's name. If a file is not present in pwd or
specified absolute path, it ignores it.
Keywords : choose_largest_file, largest_file, find_largest_file
get_the_largest_file, choose_the_largest_file, get_biggest_file
fetch_largest_file, take_largest_file, get_bigger_file, get_larger_file
Options : _ for debugging.
# for debugging.
e for extract the largest from the input array
leaving it one element less, in this case
there will be two returning refs.
Usage : $largest_file_name=${&get_largest_file(@ARGV)};
Version : 1.4
Argument : ref. of string.
Example : ${&get_sequence_complexity(\$seq)};
while $seq='TTTTTACDEFGHIKLMNPQRSTVWYAAAAACCCADFADFA'
Function : caculates the single sequence's sequence complexity
If the seq given is larger than 20, it divides it into
frags of 20 aa and gets the average of it.
Keywords : sequence_complexity, calc_sequence_complexity,
calc_seq_complexity, get_seq_complexity,
Options : _ for debugging.
# for debugging.
'w=' for window size as the first arg
Returns : Ref. for a scalar digit.
Usage : print "\n", ${&get_sequence_complexity(\$seq)};
Version : 1.3
Argument : gets names of sequences
eg) \@array, \%hash, \$seq, while @array=(seq1, seq2), $seq='seq1 seq1'
%hash=(seq1, xxxx, seq2, yyyy);
Example : %seq=%{&fetch_sequence_from_db(\@input, seq.fa, seq.fa.idx)};
while @input=qw( 11S3_HELAN_11-31 A1AB_CANFA A1AT_PIG )
Function : accept seq names (with or without ranges like _10-111 )
and produces hash ref.
As an option, you can write(xxxx.fa) the sequences in pwd
with the file names with sequence names.
The default database used is FASTA format OWL database.
You can change this by S (for Swissprot either fasta
or full format), P for PDB fasta format data.
If you give the path name of DB, it will look for the
DB given.
This automatically checks sequence family number as
in >d1bpi___7.6.1
and attaches the number in final %sequence output
Keywords : fetch_seq_from_db, fetch_sequence_from_database
Options : _ or # for debugging.
w for write fasta file
s= for putting source DB file name manually
d=p100 for PDB100 fasta database from ENV
d=p40 for PDB40 fasta database from ENV
d=p for PDB database (usually p100) from ENV
d=s for Swissprot database from ENV
d=o for OWL database from ENV
i= for index filename. If not specified, this looks for it in the same dir as fast ˜
t= for msp_threshold
msp_threshold=0.0005 # when MSP file is given as input for getting seq names
Returns : ref of hash
Usage : %sequence=%{&fetch_sequence_from_db($input_file, \@string)};
Version : 3.5
Argument : swissprot seqname
Example : &fetch_swiss_seq(@ARGV);
Function : fetches swissprot entry or fasta format seq with
given seq name(like SAA_HORSE, SA*HORSE, SAA,..)
you can give multi files(SAA*, SAU*) at the same
time. This uses ENV setting of 'SWDIR'
Keywords : fetch_swissprot_sequence, fetch_sequence,
find_swiss_sequence, find_sequence
Options : _ for debugging.
# for debugging.
-f for fasta format file output
-a is for ALL matched seq. (same as using glob=> *YEAST)
-c is for Creating seq.idx file
-h is for HELP!
-g is for GDF file format output
-l is for list of match entries(in 1 column)
-s is for species option (input name mst be species (YEAST, RAT, HUMAN..)
n= is for Number of seq you want to get from swissprot
s= is for Size limit. Min seq size in swiss, s=10 -> minimum 11 aa seq.
S= is for Size limit. Max seq size in swiss, s=1000 -> get less than 1000
Usage : &fetch_seq(@ARGV);
Version : 1.6
Argument : swissprot seqname
Example : &fetch_swiss_seq(@ARGV);
Function : fetches swissprot entry or fasta format seq with
given seq name(like SAA_HORSE, SA*HORSE, SAA,..)
you can give multi files(SAA*, SAU*) at the same
time. This uses ENV setting of 'SWDIR'
Keywords : fetch_swissprot_sequence, fetch_sequence,
find_swiss_sequence, find_sequence, fetch
Options : _ for debugging.
# for debugging.
-f for fasta format file output
Version : 1.0
Function : reads database and tells how many sequences are there
fasta format db is only accepted for now.
Keywords : count_number_of_sequence, get_number_of_sequence
get_sequence_number_in_fasta
Options : _ for debugging.
# for debugging.
Version : 1.2
Example : &write_msp_files(@sso, 's', $out_file);
Function : Writes input which is already in msp file format to
files either the name is given or generated
If more than one ref of hash is given, this will
concatenate all the hashes to one big one to
make one file.
When NO output xxx.msp file name is given, it creates
with the query sequence name.
Keywords : write_msp,
Options : _ for debugging.
# for debugging.
s for each single file output for each hash input
filename for putting output to the specified filename, should be xxx.msp
Returns : if 's' option is set, it will make say,
HI001.msp HI002.msp HI003.msp rather than
HI001HI002HI003.msp
eg of one output(single file case)
1027 0.0 1 154 HI0004 1 154 HI0004
40 0.0 84 132 HI0004 63 108 HI0001
31 0.0 79 84 HI0004 98 103 HI0003
Usage : &write_msp_files(\%in1, \%in2, ['s'], [$filename],,)
Version : 2.8
Warning : When NO output xxx.msp file name is given, it creates
with the query sequence name.
Example : &write_aln(\%hash, \$out_file_name);
CLUSTAL W (1.74) multiple sequence alignment
MMAF6040_1 -----MATDD--SIIVLDD----DDEDEA-AAQP-GPSNLPPN-PASTGPGPGLSQQATG
AF015956_1 -----MATAN--SIIVLDD----DDEDEA-AAQP-GPSHPLPN-AASPGAG---------
HSAB2381_80-900 KQRLLSVTSDEGSMNAFTGRGSPDTEIKINIKQESADVNVIGNKDVVTEEDLDVFKQAQE
.* : *: .: . * * : * . : * . . .
Function : writes multiple seqs. in msf format (takes one or more than one seq.!!)
Options :
$first_sequence_name= by f= # to put a certain seq at the first in writing
Usage : two argments: $seq_hash_reference and $output_file_name
takes a hash which has got names keys and sequences values.
uses Perl5 pointers(references).
Version : 1.1
Example : @blocks_in_hash=@{&get_seqblock(\%msf, 30)};
Keywords : find_sequence_block, get_sequence_block,
make_seq_block, make_seqblock, find_seqblock
Options : _ for debugging.
# for debugging.
m= for margin length of the seqblock
t= for threshold
l= for min seqlet length
Version : 1.3
Keywords : add_seq_columns, add_sequence_columns,
Options : _ for debugging.
# for debugging.
Version : 1.2
Warning : if the attached name is too long(over 12 char),
it changes to 'Added_upX' while X is a numb.
Argument : accepts one single ref. of hash
Example : %block_start_end=%{&get_high_score_blocks(\%input_numb_block)};
%out=%{&get_high_score_blocks(\%inp_numbs, 'v', 'b')};
Function : gets hash of key and number string and filters out the
number string region which is below certain threshold
determined inside this sub and returns a selected high
number regions
Keywords : high_scoring_regions
get_high_scoring_blocks, find_blocks, get_blocks
Options : _ for debugging.
# for debugging.
b for best_block_opt, returns best block only
v for showing the final range hash output
c for connect close blocks
c= for connect close blocks with specific closing gap size
m= for margin length of the seqblock
t= for threshold
l= for min seqlet length
Usage : get_high_score_blocks()
Version : 1.4
Warning : This assumes that the inputs are multiply aligned seq
Function : gets the name of sequence used as enquiry(target)
Keywords : get_msp_target_sequence, get_msp_enquiry_sequence_name
Options : _ for debugging.
# for debugging.
Version : 1.0
Function : gets the name of sequence used as enquiry(target)
Keywords : get_msp_matched_sequence_name
Options : _ for debugging.
# for debugging.
Version : 1.0
Example : seq1 ------------------------------
|||||||||||
seq2 --------------------------------
OUT 000000000011111111111000000000000000000
Function : opens msp file and links the sequences according
to the matches.
Keywords : link_sequence_from_msp_file, linked_sequenced_length
get_clustered_sequence_length, get_annexed_sequence_length
connect_sequences, merge_sequences, combine_sequences
Options : _ for debugging.
# for debugging.
Returns : A ref. of an array
Version : 1.0
Author : jong@salt2.med.harvard.edu sat@mrc-lmb.cam.ac.uk
Function : The content of out %average is
$averaged{$position}=[$residue1, $sec_str2, $dif_reliability];
Keywords : get_average_predator_prediction, average_predator_prediction
get_averaged_sec_prediction, get_average_prediction
Options :
$reverse_order_of_one_hash=r by r
$give_weight_with_good_match=w by w # this is to give preference to well
$weight_factor= by w=
matching sec. str. I add '0.1'
Usage : %av_for_back_pred=%{&get_averaged_prediction(\%sec1, \%sec_rv)};
Version : 1.2
Function : This is a sub used for plot_domains.pl for
genome_analysis
Options : _ for debugging.
# for debugging.
Usage : &plot_vertically(\@query);
Version : 1.1
Example : @output=@{&condense_number_string(\@input, $factor)};
with @input=qw(1 2 4 10 10 22 2 3 44 2 3); and $factor=3
Function : condenses the numbers by making an average with
given factor. If the factor is 2 on number seq
1334284425 , result will be 23543
133428442 , 23541 <-- preserved end
Factor 3 =>
133428442 , (1+3+3)/3 = 2
(4+2+8)/3 = 4,,,
Keywords : compact_number_string, compact_digits, condense
condense_string
Options : _ for debugging.
# for debugging.
Version : 1.1
Example :
%test=('seq1', '1234AAAAAAAAAAAaaaaa', 'seq2', '1234BBBBBBB');
@range = ('1-4', '5-8');
%out = %{&get_seq_fragments(\%test, \@range)};
%out => (seq1_5-8 AAAAA
seq2_5-8 BBBBB
seq1_1-4 1234
seq2_1-4 1234 )
Function : gets sequence(string) segments with defined
ranges.
Keywords : get_sequence_fragments,
Options : _ for debugging.
# for debugging.
l= for min seqlet length
r for adding ranges in the seq names
Usage : @seq_frag=&get_seq_fragments(\%msf, @RANGE);
Version : 1.8
Author : jong@salt2.med.harvard.edu
Class : Utility
Example : &make_standalone_subroutines(@ARGV);
Function : Creates each subroutine derived xxx.pl file from B.pl or any
given library file. If there is a file for a sub already, it
skips.
Usage : &make_standalone_subroutines(@ARGV);
Version : 1.1
Argument : Ref of Hash, Array or just filename, and wanted column numbers.
Example : For getting only necessary columns
Input: %Hash=(1, 'col1 col2 col3',
2, 'col1 col2 col3',
3, 'col1 col2 col3');
input format: &get_column(\%Hash, 3,2,1, 'k'); # k is opt
Ouput format: STDOUT as
1 col3 col2 col1
2 col3 col2 col1
3 col3 col2 col1
Function : Prints any specified columns, can change order of them,
can filter values of columns to filter (max or min value)
Skipps blank line.
Keywords : columns, column.pl, column, get_columns, take_columns,
Options : # for debugging.
_ for debugging.
k for Key print when hash input is given.
n for no first line display(Handy when you have title line
and wanna remove it)
?max?=xxx for filtering column numbers by maximum of xxx
?min?=yyy for filtering column numbers by minimum of yyy
(eg, min4=100000 means 4th column minimum is 100000)
(eg, 1min4=10, 2min3=10, means get 4th column values
below 10 as the first output column. Get 3rd
column values below 10 as the second out column.
$combine = 1 by -c c # c is for combining columns in different files
$ignore = 1 by -i i # i is for ignoring leng diff in columns over 1 input
Returns : Ref of
Usage : &get_column(\@ar, 1,2 ,3);
&get_column(\%ha, 1,2 ,3);
&get_column(@ARGV);
# where prompt is like: column.pl temp.txt 1 2 3 4
Version : 1.5
Example : set_debug_option # <-- at prompt.
Function : If you put '#' or '##' at the prompt of any program which uses
this sub you will get verbose printouts for the program if the program
has a lot of comments.
Options : # for 1st level of verbose printouts
## for even more verbose printouts
$debug becomes 1 by '#' or '_'
$debug2 becomes 1 by '##' or '__'
Returns : $debug
Usage : &set_debug_option;
Version : 1.8
Argument : \%ref_of_seq
Example : @out=@{&write_sdb_file(\%seq, 'v')}; ## for STDOUT as well
___________________________________________________________________________
Title : EST_YEAST.sdb
Full Name : Telomerase_yeast_699aa
Nicknames :
EMBL :
PDB :
Swissprot :
Function : gets a hash ref. and writes the SDB file with 'sprintf'
Keywords : write_sdb
Options : v for verbose representation. This will print boxes on STDOUT
n for no '#' leader.
e for Endline( '-----------------------------..' )
Usage : @out=@{&write_sdb_file(\%seq)};
Version : 1.0
Warning : if version no. is null, it automatically puts '1.0'
Argument : two references. The first should be an array ref. The 2nd can be either
scalar or array reference.
Function : returns ref. of an array for a list of non-repetitive entry.
Keywords : add_if_not_already, add_element_if_not_already, if_not_already
add_element_if_not_already, push_element_if_not_already,
if_no_already_push, put_element_if_not_already, add_new_element
add_new_items_only, push_new_items_only, push_new_elements_only
put_if_not_already,
Returns : a ref. of an array.
Usage : @out=@{&push_if_not_already(\@mother_array, \@adding_array )};
@out=@{&push_if_not_already(\@mother_array, $adding_scalar)};
Version : 1.3
Argument : eg=> (\%ref_hash, 4)
Example : %stat=%{&get_peptide_occurance(\%pro_sequence, $size)};
while %pro_sequence has one or more sequences like
seq1 AAAAAAAAAAAA, seq2 BBBBBBBBBBBBBB, ...
$size is number. For dipeptide=2, tripeptide=3, tetrapep=4...
Function : gets the number of occurances of peptide(with given size) for
any number of sequences given.
Version : 1.2
Argument : \@array
Function : This produces a hash ref. which is supposed to be most probable
according to the given array. It divides array into halves
gets the more probable half until it gets one single number.
Keywords : get_frequent_halves,
Version : 1.0
Function : divides any array to the denominator given.
If you give array of 100 elem, with 5, you will
get 5 arrays with 20 elem each.
Keywords : split_array_into_pieces, split_array, chop_array,
fragment_array,
Options : s= for dividing the array with sub array size
eg) to get 20 elem length sub arrays from
a big array
@ar_ref=@{÷_array(\@array, 's=20')};
Usage : &show_array(÷_array(\@input, 6));
Version : 1.4
Example : &show_array( ÷_string(\%input, 3) );
while $input is 'seq', '12345789ABCDEFHIJKLMN'
The output will be 'seq_1_half', '1234578'
'seq_2_half', '9ABCDEF'
'seq_3_half', 'HIJKLMN'
Function : divides any string to the denominator given.
Keywords : divide_string, split_string, chop_string, divide_sequence
split_sequence(look at separate split_sequence sub),
Options :
$reverse_second_half=S by S -S
$reverse_first_half =F by F -F
$reverse_rest =R by R -R ## reversing all except the first
$reverse_all =A by A -A # reverse all the fragments
Usage : %out=%{&split_sequence(\%input, 2 )};
Version : 1.3
Example : &show_array( ÷_string(\$input, 3) );
while $input is '12345789ABCDEFHIJKLMN'
The output will be '1234578 9ABCDEF HIJKLMN'
Function : divides any string to the denominator given.
Keywords : divide_string, split_string, chop_string, divide_sequence
split_sequence(look at separate split_sequence sub),
Usage : &show_array(÷_string(\$input, 6));
Version : 1.4
Function : write html format headbox explanation with
given hashes of headbox content.
Keywords : write_headbox_html, write headbox in html,
write_headbox_files
Options : 'd' for date inclusion at the top of the page
f= for default ftp dir name
Usage : &write_html_headbox($outfilename, \%entries);
Version : 1.7
Warning : It takes off the last '/' when $URL has it
Argument : One or None. If you give an argu. it should be a ref. of an ARRAY
or a filename, or ref. of a filename.
If no arg is given, it reads SELF, ie. the program itself.
Example : Output is something like
('Title', 'read_head_box', 'Tips', 'Use to parse doc', ...)
Keywords : read_sdb_files,read_sdb,
Options : 'b' for remove blank lines. This will remove all the entries
with no descriptions
Returns : A hash ref.
Usage : %entries = %{&open_sdb_files(\$file_to_read )};
Version : 1.1
Argument : 1 hash ref which has model name and template name -> (\%hash)
while %hash is (modelname, tempalatename)
Example :
$modelname = 'gfct';
$template = '1ovt';
%hash=($modelname, $template);
&write_modeller_top_file(\%hash);
Function : Writes Modeller command file format.
Options : v for verbose. You will get STDOUT of the result as well as file
Returns : a file of xxxx.top form.
Usage : &write_modeller_top_file(\%hash, [v]);
Version : 1.0
Argument : 2 ref. of hash for seq. and optional output.name and option(s).
If second input hash (for template) has 3rd and 4th element which are
numbers they are regarded as the starting and ending number of the
template(i.e. pdb file seq)
Example :
$out = 'test.ali';
%model = qw(model AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAccccccccccc);
%template = qw(templ CCAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCC 3 42);
&write_modeller_ali_file(\%model, \%template, \$out);
Function : Writes Modeller alignment format.
Options : You can put 2 numbers for the second set of key and element for
the second hash input as the starting and ending points of
template(i.e. pdb file seq). Unless I calculate the size of seq.
By default, it reads PDB file defined by ENV setting of 'PDB' and
gets the starting number of pdb. If starting number is defined
explicitly at input hash, the given starting number is used instead
of PDB's.
v for verbose. You will get STDOUT of the result as well as file
Returns : a file of xxxx.ali form.
Usage : &write_modeller_ali_file(\%model, \%template, [\$outfilename], [v]);
Version : 1.0
Function : makes template of sec. str. like: 'H5 E4 E2' out of '__HHHHH__EEEE__EE__'
Usage : %target = %{&make_template_from_sec_str(\%seq)};
Version : 1.1
Author : jong@salt2.med.harvard.edu
Function : Writes subroutine file xxxx.psub with given headbox including
hash
Usage : @out_file=@{&write_subroutines(\%head_box)};
Version : 1.0
Function : retunrns ALL subroutines with the keys as subroutine names
with version like ('show_array2.2' => 'subroutine in one string')
It reports the subroutines not found in searched file(s)
Options : 'nv' for no version attachment in the keys of returning hash of subroutines
'r' for getting remnant file content rather than the sub routines
't' for leaving the original file without the sub routines taken.
$separate_hash_entry_opt=s by s
Usage : @out_subs=@{&read_subroutines(\@file, $separate_hash_entry_opt)}; or
%out_subs=%{&read_subroutines(\@file)};
Version : 1.2
Function : retunrns subroutines with the keys as subroutine names with version
like in the form( 'show_array2.2' => 'subroutine in one string')
It reports the subroutines not found in searched file(s). This
requires the names of sub you want while read_subroutines will
read any subroutines with their headbox to a hash.
Options : 'nv' for no version attachment in the keys of returning hash of subroutines
'r' for getting remnant file content rather than the sub routines
't' for leaving the original file without the sub routines taken.
'h' for headbox only output.
Version : 2.5
Example : &update_subroutines($file, \%fetched_subs);
Function : replaces subroutines of given file(s) with supplied subs.
If the given subroutine versions are not higher than the
ones in the program, no upgrade would happen.
This can read version information from '# Version : 1.0' line
or sub xxxxx{ # Version : 1.0 line
Keywords : upgrade_subroutines,
Usage : &update_subroutines(\@file, \%fetched_subs);
Version : 2.8
Function : retunrns subroutines with the keys as subroutine names with version
like in the form( 'show_array2.2' => 'subroutine in one string')
It reports the subroutines not found in searched file(s)
fetch_subroutines also has this feature.
Keywords : take_out_subroutines, take_subroutines, cut_subroutines,
cutout_subroutines, remove_subroutines
Options : 'nv' for no version attachment in the keys of returning hash of subroutines
'r' for getting remnant file content rather than the sub routines
Version : 1.5
Warning : If there is no headbox and version no. It thinks the version
is 1.0
Function : gets all the subroutine calls( like &show_hash ) in the given
file name or array of lines which is the content of a file,
text etc. If there is no input arg, it reads the running
program as default input
Keywords : get_sub_names,get_subroutine_names, get_sub_calls,
get_subroutine_calls, find_sub_calls, find_subroutine_calls
Usage : @sub_name_array= @{&get_subroutine_calls(\@AR))};
Version : 2.2
Argument : Nothing in a program.
Example : &set_special_options.pl ## <-- at prompt.
Function : If you put special chars like '#' or '##', '###..' at the
prompt of any program which uses
this sub you will get verbose printouts for the program if
the program has a lot of comments.
Options : # for 1st level of debugging printouts
## for even more debugging printouts
+ for more outputs(more calculations are shown, like statistics)
++ even more outputs.(
$DEBUG becomes 1 by '#'
$DEBUG2 becomes 1 by '##'
$VERBOSE becomes 1 by '+'
$VERBOSE2 becomes 1 by '++'
Returns : $debug, $verbose
Usage : &set_special_options;
Version : 1.0
generalized debug var is added for more verbose printouts.
Example : set_debug # <-- at prompt.
Function : If you put '#' or '##' at the prompt of any program which uses
this sub you will get verbose printouts for the program if the program
has a lot of comments.
Options : # for 1st level of verbose printouts
## for even more verbose printouts
$debug becomes 1 by '#' or '_'
$debug2 becomes 1 by '##' or '__'
Returns : $debug
Usage : &set_debug;
Version : 1.8
generalized debug var is added for more verbose printouts.
Function : This is the core part of any window (of sequences)
scanning function.
Keywords : scan_sequence, scan_window
Usage : @out_array = @{&do_window_scan(\@input_array, $win_size)};
Often, bioters(Bio Computer Scientists) need to scan a long sequences
of DNA or Protein like(ABADFAFASDFASFASDFDFA or 109384717817947) to
caculate something out of them.
This routine is providing such scanning
function in perl.
Version : 1.3
Function : scans any given length window of sequence and computes something.
Options : average for getting average of given window size.
sum for getting sum of given window size.
Version : 1
Example :
- - - - - EXample of blastp file - - - - - - - - - - - - - - - - - - - - - - - - -
BLASTP 1.4.8 [19-Dec-94] [Build 16:06:14 Jul 26 1995]
Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.
215:403-10.
Query= 1mbs
(153 letters)
Database: /nfs/ind4/ccpe1/people/A Biomatic /jpo/align/all_in_fasta.fas
406 sequences; 77,134 total letters.
Searching..................................................done
WARNING: -hspmax 100 was exceeded with 13 of the database sequences, with as
many as 173 HSPs being found at one time.
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
1mbs 804 2.0e-109 1
1pmb 718 1.4e-97 1
1ymb 707 4.7e-96 1
2xxx 31 0.55 1
Function : This reads the output of blastp program(xxxx.bla or whatever file extension
you attatched). And produces the names of found sequences which are
above(smaller in probability) a certain threshold in the blast result.
For example, it will produce a reference of an array (@hits, in the code)
which contains (1mbs, 1pmb, 1ymb) from the example in this header box(down the
lines) with the given (you give!) threshold of, say, 0.0001.
Keywords : bla2fasta, take_blast_hits
Usage : @array_of_names = @{&read_blast_hits(\$file_name, \$threshold)};
Version : 1.1
Argument : 3 arg. One is the string, second is the interval number, third is
the gap separater
Example : "1234567890123456789012345678901234567890" will be
"1234567890 1234567890 1234567890 1234567890"
with
&put_gaps_every_x_position_in_string(\$test, 10, ' ')
Keywords : put_space_in_sequence, put_gaps_in_sequence, put_gaps,
put_space
Returns :
every char.
Version : 1.1
Warning : it does not returns reference
Argument : hash(es) and Matrix or table for conversion.
Example :
IN => to transform E and H to 9 and 4
1cdg_6taa -------EEE-----------HH--HHHH------EE---------EEE-
1cdg_2aaa -------EEE-----------HH--HHHH------EE---------EEE-
2aaa_6taa -------EEEEE------EE-HHHHHHHH----EEEE-------EEEEE-
OUT
1cdg_6taa -------999-----------44--4444------99---------999-
1cdg_2aaa -------999-----------44--4444------99---------999-
2aaa_6taa -------99999------99-44444444----9999-------99999-
Function : transform any value to another value with given table, matrix..
This is used to transform Amino Acid to its various propensities
If you feed a sequence 'ACDEDA', this transforms it to '
'124741' if the table given is 'A->1, C->2, D->4, E->7'
Returns : hash(es)
Sheraga_alpha_matrix
Richardson_alpha_matrix or any conversion table made in a hash.
Usage : Used in predict_secondary_structure
Version : 1.0
Argument : Two references of hashes. One for error rate the other for sec.
assignment.
Example : First block is for the first hash input
and Second is for the second hash input.
1cdg_6taa 00000442222222222242222222222777700000007000000000
1cdg_2aaa 00000442222222222242222222222777700000007000000000
2aaa_6taa 00000000000000000000000000000000000000000000000000
1cdg_6taa -------EEE-----------EE--EEEE------EE---------EEE-
1cdg_2aaa -------EEE-----------EE--EEEE------EE---------EEE-
2aaa_6taa -------EEEEE------EE-EEEEEEEE----EEEE-------EEEEE-
2aaa_6taa -------00000---------00000000----0000-------00000-
1cdg_6taa -------442---------------2222-----------------000-
1cdg_2aaa -------222---------------2222-----------------000-
2aaa_6taa 0%
1cdg_6taa 67%
1cdg_2aaa 67%
Function : calculates the secondary structure segment shift rate.
Options : 'p' or 'P' for percentage term(default)
'r' or 'R' for ratio term (0.0 - 1.0), where 1 means all the
segments were wrongly aligned.
's' or 'S' for Shift rate (it actually caculates the position shift
rate for the secondary structure segment.
'h' or 'H' for position Shift rate (it actually caculates the position
shift rate for helical segments). If this is the only option, it
will show the default percentage term rate for helical segments.
If used with 'r', it will give you ratio (0.0 - 1.0) for helical
segment. If used with 's' option, it will give you position shift
rate for only helical segments.
'e' or 'E' for position Shift rate (it actually caculates the position
shift rate for beta segments). If this is the only option, it will
show the default percentage term rate for beta segments. If used
with 'r', it will give you ratio (0.0 - 1.0) for beta. If used
with 's' option, it will give you position shift rate for only
beta segments.
Usage : &get_segment_shift_rate(\%hash_for_errors, \%hash_for_sec_str);
Version : 1.1
Example : hash of 3 keys and values.
2aaa_6taa -------00000---------00000000----0000-------00000-
1cdg_6taa -------442---------------2222-----------------000-
1cdg_2aaa -------222---------------2222-----------------000-
In the above there are two segments wrong in 3 segment blocks = 2/3
Argument : hashes and [options]. No options result in default of 'H3', 'E3'
Example : print_seq_in_block(&tidy_secondary_structure_segments(\%hash, 'e4', 'h4'), 's');
1cdg_2aaa -------EEE-----------EE--EEEE------EE---------EEE-
1cdg_6taa -------EEE-----------EE--EEEE------EE---------EEE-
2aaa_6taa -------EEEEE------EE-EEEEEEEE----EEEE-------EEEEE-
1cdg_6taa -------------------------EEEE---------------------
1cdg_2aaa -------------------------EEEE---------------------
2aaa_6taa -------EEEEE---------EEEEEEEE----EEEE-------EEEEE-
Function : receives any secondary structure assignment hashes and
tidys up them. That is removes very shoft secondary structure
regions like( --HH--, -E-, -EE- ) according to the given minimum
lengths(threshold) of segments by you.
Options : something like 'H3' or 'E3' for minimum segment length set to 3 positions.
Returns : array of references of hashes.
Usage : print_seq_in_block(&tidy_secondary_structure_segments(\%hash, 'e4', 'h4'), 's');
Version : 1.0.0
Argument : hashes and [options]. No options result in default of 'H3', 'E3'
Example : print_seq_in_block(&define_secondary_structure_segments(\%hash, 'e4', 'h4'), 's');
1cdg_2aaa -------EEE-----------EE--EEEE------EE---------EEE-
1cdg_6taa -------EEE-----------EE--EEEE------EE---------EEE-
2aaa_6taa -------EEEEE------EE-EEEEEEEE----EEEE-------EEEEE-
1cdg_6taa -------------------------EEEE---------------------
1cdg_2aaa -------------------------EEEE---------------------
2aaa_6taa -------EEEEE---------EEEEEEEE----EEEE-------EEEEE-
Function : receives any secondary structure assignment hashes and
tidys up them. That is removes very shoft secondary structure
regions like( --HH--, -E-, -EE- ) according to the given minimum
lengths of segments.
Options : something like 'H3' or 'E3' for minimum segment length set to 3 positions.
Returns : array of references of hashes.
Usage : print_seq_in_block(&define_secondary_structure_segments(\%hash, 'e4', 'h4'), 's');
Version : 1.0
Argument : 2 ref for hash of identical keys and value length.
Example : %out =%{&overlay_seq_by_certain_chars(\%hash1, \%hash2, 'E')};
output> with 'E' option >>> "name1 --HHH--1232-"
Function : (name1 000000112324)+(name1 ABC..AD..EFDK ) => (name1 000..00..12324)
(name2 000000112324)+(name2 --HHH--EEEE-- ) => (name1 ---000--1123--)
uses the second hash a template for the first sequences. gap_char is
'-' or '.' or any given char or symbol.
To insert gaps rather than overlap, use insert_gaps_in_seq_hash
Keywords : Overlap, superpose hash, overlay, superpose_seq_hash
Options : E for replacing All 'E' occurrances in ---EEEE--HHHH----, etc.
: H for replacing all 'H' " " "
Returns : one hash ref.
Usage : %out =%{&overlay_seq_by_certain_chars(\%hash1, \%hash2, 'HE')};
Version : 1.0
Warning : If gap_chr ('H',,,) is not given, it replaces all the
non-gap chars (normal alphabet), ie,
it becomes 'superpose_seq_hash'
Argument : one pdb coordinate file reference
Example :
The INPUT example >
ATOM 191 CA ALA 195 -2.566 8.099 42.827 1.00 12.42 1ENG 256
ATOM 192 CA ARG 196 -1.401 11.546 41.629 1.00 8.63 1ENG 257
ATOM 193 CA THR 197 -4.073 13.846 43.107 1.00 9.93 1ENG 258
The OUTPUT example >
ATOM 1 CA ALA 1 -2.566 8.099 42.827 1.00 12.42 1ENG 256
ATOM 2 CA ARG 2 -1.401 11.546 41.629 1.00 8.63 1ENG 257
ATOM 3 CA THR 3 -4.073 13.846 43.107 1.00 9.93 1ENG 258
<2nd file, called xxxx2.atm >
ATOM 1 CA THR 1 -4.073 13.846 43.107 1.00 9.93 1ENG 258
ATOM 2 CA ARG 2 -1.401 11.546 41.629 1.00 8.63 1ENG 257
ATOM 3 CA ALA 3 -2.566 8.099 42.827 1.00 12.42 1ENG 256
Function : reorders the lines of any pdb files, but takes only C alpha positions.
Options : None
Returns : directly writes two output files xxxx1.atm xxxx2.atm
Usage : &rev_lines_pdb(\$ARGV[0]);
Version : 1.0
Warning : A Biomatic
Argument : (\%hash1, \%hash2) or optionally (\%hash1, \%hash2, ['n', 'i', 'p', 'a'])
'n' => normalizing, 'p' => percentage out, 'i' => make int out, 'a'=> averaged
Example : you put two hash refs. (ass. array) as args (\%hash1, \%hash2)
The hashes are like; hash1 (name1, 0000011111, name2, 0000122222 );
hash2 (name3, 1324..1341, name4, 13424444.. );
1) The resulting 1st hash output is (0, 20, 1, 13, 2, 12)
which means that 0 added up to 24 in the second arg hash positions
1 added up to 15 in the second arg hash positions
2 added up to 18 in the second arg hash positions
'p' option only works with 'n' or 'a'
2) The resulting 2nd hash output is (0, 5, 1, 5)
which means that 0 occurred 5 times in the first input hash
1 occurred 5 times in the first input hash
'p' option only works with 'n' or 'a'
Function : Makes hashes of tallied occurances and summed up values for disits in
positions.
calculates the occurances or occurance rates of CS rate positions.
The hashes should have numbers.
Keywords : tally two hashes of numbers.
Options : [a n i p]
Returns : ($ref1, $ref2), ie, two references of hash
averaging option causes division of 20(added up value)
by 9(occurance) in the above
for '0' of the first hash, so (0, 2.222, 1, 2.1666, 2, 2.4 )
Average is the average of numbers
average value in 0-9 scale (or 0-100 with 'p' option)
So, if there are
seq1 00111110000, The 'a' value of 0 and 1 as in the seq2
seq2 33000040000 is 0-> 6/6, 1-> 4/5, while the 'n'
calc would be, 0-> 6 (60%), 1-> 4(40%)
Usage : ($ref1, $ref2) = &tally_2_hashes(\%hash1, \%hash2, ['n', 'a', 'p', 'i']);
%tally_addedup=%{$ref1}; '0' position had addedup value of 1000
%tally_occurances=%{$ref2}; '0' position had occurred 100 times,
'0' on average had 10 in its
corresponding hash positions
Version : 1.2
Argument : 2 refs. for hash of identical keys and value length and gap_chr.
Function : (name1 000000112324)+(name1 ABC..AD..EFD ) => (name1 000..01..324)
uses the second hash a template for the first sequences. gap_char is
'-' or '.'
To insert gaps rather than overlap, use insert_gaps_in_seq_hash
Keywords : overlay sequence, overlay alphabet, superpose sequence,
Returns : one hash ref.
Usage : %out =%{&superpose_seq_hash(\%hash1, \%hash2)};
Version : 1.0
Warning : Accepts only two HASHes and many possible gap_chr. Default gap is '-'
Argument : 2 refs. for hash of identical keys and value length and gap_chr.
Function : (name1 000000112324)+(name1 ABC..AD..EFD ) => (name1 000..01..324)
uses the second hash a template for the first sequences. gap_char is
'-' or '.'
To insert gaps rather than overlap, use insert_gaps_in_seq_hash
Returns : one hash ref.
Usage : %out =%{&overlay_seq_hash(\%hash1, \%hash2)};
Version : 1.0
Warning : Accepts only two HASHes and many possible gap_chr. Default gap is '-'
Argument : 2 ref for hash of identical keys and value length.
Function : superpose two hashes of the same sequence or same seq. length sequences,
but unlike 'superpose_seq_hash', this inserts gaps and extend the
sequences.
(name1_sec hHHHHHH EEEEEEE) +
(name1_seq .CDEABC..AD..EFD..EKST) => (name1_ext .hHHHHH..H...EEE..EEEE)
In the example, the undefined sec. str. position is replaced as gaps('.')
Uses the second hash a template for the first sequences. gap_char is
'-' or '.'
One rule is that the SECOND hash contains gaps!!
There are two types of hash input. One is simple seq hash(both args)
The other is from secondary structure prediction. The hash has contents
like: $averaged{$position}=[$residue1, $sec_str2, $dif_reliability];
Keywords : superposing sequences with gaps, interpolate_sequences, interpolate_gaps
Returns : one hash ref.
Usage : %out_extended_seq =%{&insert_gaps_in_seq_hash(\%hash1, \%hash2)};
Version : 1.3
Warning : coded by A Biomatic
Example : input hash: ( seq1, '13241234141234234', (2 or more sequences accepted)
seq2, '1341324123413241234')
input winsize : 5;
output hash; (seq1, 1234123413241234);
output hash; (seq2, 1344234123412341);
The numbers are ratios(compos/seqid) with given window size.
Usage : %out1 = %{&scan_win_get_av(\%input, \$window_size, \%input2,,,,)};
The order of the arguments doesn't matter.
Version : 1.0
Argument : One ref. for hash, one ref. for a scalar.
Example : input hash: ( seq1, 'ABCDEFG.HIK', (2 or more sequences accepted)
seq2, 'DFD..ASDFAFS',
seq3, 'DDDDD..ASDFAFS' );
input winsize : 5;
output hash; (seq1seq2, 1,2,2,2,1,1,2,2); <-- joined by ',';
output hash; (seq1seq3, 1,2,2,2,1,1,2,2); <-- joined by ',';
The numbers are ratios(compos/seqid) with given window size.
Function : scans input sequences(arg1) in a given(arg2) window size and gets
each composition and sequence identity rate(sc_rate) of the window.
sc rate = Sequence Id(%)/ Composition Id(%)
Returns : a reference of a hash.
Usage : %out1 = %{&scan_win_and_get_sc_rate_pairs(\%input, \$window_size)};
Version : 1.1
Warning : when $seqid is zero the rate becomes $compos_id/10 !!!
Argument : (\@input, \$window_size); @input => ('ABCDEFG.HIK', 'DFD..ASDFAFS', 'ASDFASDFASAS');
Input ar => ( 'ABCDEFG
'DFD..ASDFAFS'
'ASDFASDFASAS' ) as the name of @sequences.
Author : A Biomatic
Function : actual working part of scan_windows_and_get_compos_seqid_rate
Returns : \@ratio_array, \$ratio_whole_seq
Usage : @out_rate = @{&get_windows_compos_and_seqid_rate_array(\@seq, \$win_size)};
Version : 1.0
Argument : One ref. for hash, one ref. for a scalar.
Example : input hash: ( seq1, 'ABCDEFG.HIK', (2 or more sequences accepted)
seq2, 'DFD..ASDFAFS',
seq3, 'DDDDD..ASDFAFS' );
input winsize : 5;
output hash; (seq1seq2, 1,2,2,2,1,1,2,2); <-- joined by ',';
output hash; (seq1seq3, 1,2,2,2,1,1,2,2); <-- joined by ',';
The numbers are ratios(compos/seqid) with given window size.
Function : scans input sequences(arg1) in a given(arg2) window size and gets
each composition and sequence identity rate(cs_rate) of the window.
CS rate = Composition Id / Sequence Id
Returns : a reference of a hash.
It is getting the entropy of the column and calculates something after.
Usage : %out1 = %{&scan_win_and_get_cs_rate_pairs(\%input, \$window_size)};
Version : 1.0
Warning : when $seqid is zero the rate becomes $compos_id/10 !!!
Argument : Takes a ref. for hash which have positions of residues of sequences.
Function : This is the final step in error rate getting.
gets a ref. of a hash and calculates the absolute position diffs.
Options : 'L' for limitting the error rate to 9 to make one digit output
$LIMIT becomes 'L' by L, l, -l, -L
Returns : one ref. for an array of differences of input arrays. array context.
---Example input (a hash with sequences); The values are differences after
comparion with structural and sequential alignments.
%diffs =('seq1', '117742433441...000', <-- input (can be speparated by '' or ','.
'seq2', '12222...99999.8888',
'seq3', '66222...44444.8822',
'seq4', '12262...00666.772.');
example output;
seq3_seq4 '0,1,0,0,0,.,.,.,,.,0,,0,0,,0,0,,.,0,,0,0,.'
seq1_seq2 '0,1,0,1,1,.,.,.,,.,2,,2,2,,2,2,,.,.,,2,2,1'
seq1_seq3 '0,1,0,1,1,.,.,.,,.,1,,1,1,,0,.,,.,.,,1,1,1'
seq1_seq4 '0,1,0,,1,1,.,.,.,,.,1,,1,1,0,.,.,,.,1,,2,2'
seq2_seq3 '0,1,0,,0,0,,.,.,,.,0,,1,0,,0,0,,.,0,,0,0,0'
seq2_seq4 '0,0,0,,1,0,,.,.,,.,0,,1,0,,0,0,,.,0,,0,0,.'
Usage : %position_diffs =%{&get_residue_error_rate(\@seq_position1, \@seq_position2)};
Version : 1.1
Warning : split and join char is ',';
Argument : Takes a ref. for hash which have positions of residues of sequences.
Function : This is the final step in error rate getting.
gets a ref. of a hash and calculates the position diffs.
Options : 'L' for limitting the error rate to 9 to make one digit output
$LIMIT becomes 'L' by L, l, -l, -L
Returns : one ref. for an array of differences of input arrays. array context.
---Example input (a hash with sequences); The values are differences after
comparion with structural and sequential alignments.
%diffs =('seq1', '117742433441...000', <-- input (can be speparated by '' or ','.
'seq2', '12222...99999.8888',
'seq3', '66222...44444.8822',
'seq4', '12262...00666.772.');
example output;
seq3_seq4 '0,1,0,0,0,.,.,.,,.,0,,0,0,,0,0,,.,0,,0,0,.'
seq1_seq2 '0,1,0,1,1,.,.,.,,.,2,,2,2,,2,2,,.,.,,2,2,1'
seq1_seq3 '0,1,0,1,1,.,.,.,,.,1,,1,1,,0,.,,.,.,,1,1,1'
seq1_seq4 '0,1,0,,1,1,.,.,.,,.,1,,1,1,0,.,.,,.,1,,2,2'
seq2_seq3 '0,1,0,,0,0,,.,.,,.,0,,1,0,,0,0,,.,0,,0,0,0'
seq2_seq4 '0,0,0,,1,0,,.,.,,.,0,,1,0,,0,0,,.,0,,0,0,.'
Usage : %position_diffs =%{&get_each_posi_diff_hash(\@seq_position1, \@seq_position2)};
Version : 1.0
Warning : split and join char is ',';
Argument : %{&get_posi_rates_hash_out(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment.
Output >>
seq1_seq2 1110...222...2222
seq2_seq3 1111....10...1111
seq1_seq3 1111....0000.0000
Returns : \%final_posi_diffs;
Usage : %rate_hash = %{&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.0
Warning : split and join char is ','; (space)
Argument : %{&get_posi_rates_hash_out(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment.
Output >> something like below but, without gaps, so final one is;
seq1_seq2 1110...222...2222 seq1_seq2 11102222222
seq2_seq3 1111....10...1111 -> seq2_seq3 1111101111
seq1_seq3 1111....0000.0000 seq1_seq3 111100000000
Returns : \%final_posi_diffs_compact; Compare with 'get_posi_rates_hash_out_jp'
Usage : %rate_hash = %{&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.0
Warning : split and join char is ','; (space)
Argument : %{&get_posi_rates_hash_out_jp(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment. get_posi_rates_hash_out_jp
results in jp template sequence, while get_posi_rates_hash_out_msf does
in msf template sequence.
Output >>
seq1_seq2 1110...222...2222 <--- the alignment template is JPO's
seq2_seq3 1111....10...1111 (ie structural)
seq1_seq3 1111....0000.0000
Returns : \%final_posi_diffs;
Usage : %rate_hash = %{&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.0
Warning : split and join char is ','; (space)
Argument : %{&get_posi_rates_hash_out(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Output >>
seq1_seq2 1110...222...2222
seq2_seq3 1111....10...1111
seq1_seq3 1111....0000.0000
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment.
Returns : \%final_posi_diffs;
Usage : %rate_hash = %{&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.0
Warning : split and join char is ','; (space)
Argument : (\%hash1, %hash2, \%hash3, ....)
Example : intputhash> Outputhash>
( '1-2', '12,.,1,2,3,4', ( '1-2', '9,.,0,1,2,3',
'2-3', '12,.,1,5,3,4', '2-3', '9,.,0,4,2,3',
'4-3', '12,3,1,2,3,4', '3-1', '9,3,.,.,2,3',
'3-1', '12,4,.,.,3,4' ); '4-3', '9,2,0,1,2,3' );
Function : with given numbers in hashes, it makes a scale of 0-9 and puts
all the elements in the scale. Also returns the average of the numbs.
Returns : (\%norm_hash1, \%norm_hash2, \%norm_hash3,.... )
Usage : %output=%{&normalize_numbers(\%hash1)};
originally made to normalize the result of get_posi_rates_hash_out
in 'scan_compos_and_seqid.pl'
Version : 1.0
Argument : One ref. for hash, one ref. for a scalar.
Example : input hash: ( seq1, 'ABCDEFG.HIK', (2 or more sequences accepted)
seq2, 'DFD..ASDFAFS',
seq3, 'DDDDD..ASDFAFS' );
input winsize : 5;
output hash; (seq1seq2, 1,2,2,2,1,1,2,2); <-- joined by ',';
The numbers are ratios(compos/seqid) with given
window size.
Function : scans input sequences(arg1) in a given(arg2) window size and gets
each composition and sequence identity rate of the window.
Returns : a reference of a hash.
Usage : %out1 =%{&scan_windows_and_get_compos_seqid_rate(\%input, \$window_size)};
Warning : when $seqid is zero the rate becomes $compos_id/10 !!!
Argument : (\@input, \$window_size); @input => ('ABCDEFG.HIK', 'DFD..ASDFAFS', 'ASDFASDFASAS');
Input ar => ( 'ABCDEFG
'DFD..ASDFAFS'
'ASDFASDFASAS' ) as the name of @sequences.
Function : actual working part of scan_windows_and_get_compos_seqid_rate
Returns : \@ratio_array, \$ratio_whole_seq
Usage : @out_rate = @{&get_windows_cs_rate_array(\@seq, \$win_size)};
Version : 1.0
Argument : one of more ref. for scalar.
Example : (*out1, *out2) =&read_any_seq_files(\$input1, \$input2);
: (@out_ref_array)=@{&read_any_seq_files(\$input1, \$input2)};
: (%one_hash_out) =%{&read_any_seq_files(\$input1)};
Function : Tries to find given input regardless it is full pathname, with or
without extension. If not in pwd, it searches the dirs exhaustively.
Keywords : open_any_seq_files,
Returns : 1 ref. for a HASH of sequence ONLY if there was one hash input
1 array (not REF.) of references for multiple hashes.
Usage : %out_seq=%{&read_any_seq_files(\$input_file_name)};
Version : 1.1
Function : given an array and a start and end length,
return an array of regular expressions, where each element of the original
array has been expanded to a set of regular expressions that match the
original exactly num times, for num between the start and end length
Returns : a ref. of an array for
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
Function : remove all but one string of each set of rotations
(reverse of rotated_seq )
Returns : a ref. for
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall
Function : given a string, return all the rotations of that string
e.g. given 'abcd', return ('abcd','bcda','cdab','dabc')
Returns : a ref. for reverse complement
Usage : @out_array=@{&rotate_seq($string)};
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall ##### RevCom
Argument : a scalar for RNA sequence data
Function : translate RNA seq to protein seq.
Keywords : rna2protein, rna_2_protein, RNA2protein, translate_rna
dna2protein, convert_RNA_to_protein, RNA_2_PROTEIN, RNA_2_protein
Returns : a ref. of an array for protein translation
Version : 1.1
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall
Argument : a scalar for DNA sequence data
Function : translate DNA or RNA seq to protein seq.
Keywords : dna2protein, dna_2_protein, DNA2protein, translate_dna
dna2protein, convert_DNA_to_protein, translate_nucleic_acid
rna2protein, rna_2_protein, RNA2protein, translate_rna
dna2protein, convert_RNA_to_protein
Returns : a ref. of an array for protein translation
Version : 1.2
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall
Returns : a ref. of an array for GCG-Genbank formatted sequence record
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall
Argument : two scalars.
Function : (This is DNA seq handling routine!)
Returns : a ref. of an array for Genbank formatted sequence record
Usage : @out = @{&write_genbank_file($sequ, $header)};
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
stolen from Tisdall
Argument : \%input
Example : @out = (
$out[0] => ">name",
$out[1] => "ABCDEABCDEBCDEABCDEABCDEABCDEABCDEBCDEABCDE",
$out[2] => "TTTTTTTTDEBCDEABCDEABCDEABCDEABCDEBCDEABCDE",
$out[3] => "ABCDEABCDEBCDEABCDEABCDEABCDEABCDEBCDEABCDE",
);
Function : take Single sequence and produce single output array of fasta
Returns : ref. for an array of FASTA formatted sequence record
Usage : @output = @{&put_fasta($sequence, $name)};
Version : 1.0
Warning : Copyright (C) 1993-1994 by James Tisdall
Argument : (\$input_file_name) while $input_file_name can be 'xxx.xxx', or '/xxx/xxx/xxx/xxy.yyy'
or just directory name like 'aat' for /nfs/ind4/ccpe1/people/A Biomatic /jpo/align/aat
then, it tries to find a file with stored seq file extensions like msf, jp, pir etc
to make aat.msf, aat.jp, aat.pir ... and searches for these files.
Example : $found_file=${&find_seq_files(\$input_file_name)};
Function : (similar to find.pl) used in 'read_any_seq_file.pl'
seeks given test file in pwd, specified dir, default path etc.
If not found yet, it looks at all the subdirectories of path and pwd.
PATH environment dirs, then returns full path file name.
Keywords : find_anyj_seq_files, find any seq files, find seq files
Returns : return( \$final );
Usage : $found_file = ${&find_seq_files(\$input_file_name)};
Version : 1.0
Argument : gets a ref. of a scaler (dir name) and returns nothing(void).
Function : open dir and process all files in the dir if you wish,
and then go in any other sub
if any file(dir) is linked, it skips that file.
Usage :
$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.0
Warning : the final var $found_from_search_files_in_subdir mustn't be 'my'ed.
Argument : one ref. for SCALAR
Function : seeks text file in pwd. If not found it looks at
PATH environment dirs
Returns : one ref. for SCALAR of a full path filename.
Usage : $found_file=${&find_seq_file_old(\$input_file_name)};
Version : 1.0
Warning : << This is READABLE old version of find_seq_file
Argument : a ref. for scaler of "jp file name"
Example : jp file == seq1 ABDSF--DSFSDFS <- true sequence
seq2 T--kdf-GAGGGASF (aligned)
sst files ==> 'seq1.sst', 'seq2.sst' (in the same dir)
original sst format: seq1 hHHHHHttEEEE <-- No gaps!
seq2 hHHHHHHEEhh
After this sub ==>
(final out hash = ( seq1 hHHHH--HttEEEE <-- inserted
seq2 h--HHH-HHHEEEhh ) gaps
Function : gets the name of a file(jp file) with its absolute dir path
reads the sequence names in the jp file and looks up all
the sst files in the same directory. Puts sst sequences
in a hash with keys of sequence names.
Returns : a ref. for a hash
Usage : %out_sst_hash =%{&open_sst_files_with_gap(\$jp_file_dir_and_name)};
Version : 1.0
Warning : $jp_file_dir_and_name should be absolute dir and file name
>> This gets JP file not SST file as input !!!!
Argument : 2 hash references.
Returns : one hash reference.
Usage : %out=%{&put_gaps_in_hash(\%hash_with_gap, \%hash_sans_gap)};
%hash1=('1ctx', '111111111111111', <-- hash input without gaps
'2ctx', '2222222222222222',
'3ctx', '3333333333');
%hash2=('1ctx', 'AAA--AAAAAAAAAAAA-', <-- hash input with template gaps
'2ctx', 'BBBBBBBBBBBB-BBBB',
'3ctx', 'CCCCCC----CCCC');
>> resulting out hash;
%hash3=('1ctx', '111--111111111111-',
'2ctx', '222222222222-2222',
'3ctx', '333333----3333 );
Version : 1.0
Warning : The keys for hashes should be the same and the two sequences
should be identical.
Argument : 1 ref. of array eg)=( ABCDE--EF--GH ) while '-' is for gap.
Example : for a string '--iu--sdf-j--', it will output -2 -1 2 3 7 9 10
Function : gets gap positions of seq. and stores in an array
Keywords : get_gap_positions_in_seq, get_seq_gap_positions get_gap_positions_in_array
Options : p for all positive gaps numbering. No negatives for '---STRING--'
Returns : 1 ref. of array eg)=(2,3,7,8,10,100,122);
Usage : @gap_pos=@{&get_gap_positions(\@string1)}; <- ('A','C','D','E')
@gap_pos=@{&get_gap_positions(\$string1)}; <- ( ACDE )
Version : 1.4
Warning : uses References.
Argument : one ref. of hash
Example : @output=($ref1, $ref2, ....$refn)
each $ref is the reference of a hash of a pair of sequence
>> %pair1 = %{$ref1}; %pair2 = %{$ref2}; %pair3 = %{$ref3};
%pair1 is like; %pair1 is like; %pair3 is like;
seq1 ABCDEFAD seq1 ABCDEFAD seq2 SDFSFSDF
seq2 SDFSFSDF seq3 SDFSFSDF seq3 SDFSFSDF
Function : returns all the possible pairs of a set of sequences in
an array of references;
Returns : one ref. of array for references for hashes.
Usage : @output =@{&make_pairs_from_hash(\%input_sequence_hash);
Input example
%input = seq1 ABCDEFAD
seq2 SDFSFSDF
seq3 SDFSFSDF
Version : 1.0
Argument : takes 2 refs. of scalars for dir name (protein group name)
and threshold for rms
Example : (0.284994272623139 0.166781214203895)
The first figure is for error rate with out rms consideration
The second is for after applying threshold.
Returns : two refs. of scalar values (rates)
Usage : just type get_posi_shift_rms_whole.pl
Version : 1.0
Function : gets a ref(s) for hash and prints the content in lines of 60 char
Returns : Nothing, i.e. STDOUT
Usage : &write_jp(\%input_hash1,\%input_hash2, \%input_hash3.... );
Version : 1.0
Warning : derived from print_in_block
Argument : two references, one for hash one for scaler for threshold
Example : A hash => name1 10012924729874924792742749748374297
name2 10012924729874924792710012924729874
A threshold => 4
!! if numbers are smaller than 4, they become 1 (or true).
Outputhash => name1 11111011011111011111011011110101111
name2 11111011010001011001011010010101100
($ref1, $ref2)=&convert_num_to_0_or_1_hash(\%hash, \%hash, \$threshold);
above is the example when with more than 2 input hashes.
Function : changes all the numbers into 0 or 1 according to threshold given.
convert_num_0_or_1_hash converts threshold and bigger nums. to
'0' while convert_num_0_or_1_hash_opposite converts to '1'.
Usage : with a variable for threshold ->
%out = %{&convert_num_to_0_or_1_hash(\%input_hash, \$threshold, \%input_hash2..)};
Version : 1.0
Warning : Threshold value is set to 0 as well as all values smaller than that.
Argument : two references, one for hash one for scaler for threshold
Example : A hash => name1 10012924729874924792742749748374297
name2 10012924729874924792710012924729874
A threshold => 4
!! if numbers are smaller than 4, they become 1 (or true).
Outputhash => name1 11111011011111011111011011110101111
name2 11111011010001011001011010010101100
($ref1, $ref2)=&convert_num_to_0_or_1_hash(\%hash, \%hash, \$threshold);
above is the example when with more than 2 input hashes.
Function : changes all the numbers into 0 or 1 according to threshold given.
convert_num_0_or_1_hash converts threshold and bigger nums. to
'0' while convert_num_0_or_1_hash_opposite converts to '1'.
Usage : with a variable for threshold ->
%out = %{&convert_num_0_or_1_hash_opposite(\%input_hash, \$threshold)};
Version : 1.0
Warning : Threshold value is set to 0 as well as all values smaller than that.
Argument : one reference of HASH.
Example : A hash => name1 ABCDSSFDSF..ASDFSD.....ADFASDF...AA
name2 ASDFSD.....ADFBCDSSFDSF..ASASDF...A
Outputhash => name1 00000000001100000011111000000011100
name2 00000011111000000000000110000001110
Function : changes all the chars into 1, gaps are to 0
Keywords : convert_char, translate_char, convert_char_to_digit,
convert_char_to_number
Returns : A ref. of a hash
Usage : with a variable for threshold ->
%out = %{&convert_char_0_or_1_hash(\%input_hash)};
Version : 1.2
Argument : one reference of HASH.
Example : A hash => name1 ABCDSSFDSF..ASDFSD.....ADFASDF...AA
name2 ASDFSD.....ADFBCDSSFDSF..ASASDF...A
Outputhash => name1 00000000001100000011111000000011100
name2 00000011111000000000000110000001110
Function : changes all the chars into 1, gaps are to 0
Keywords : convert_char, translate_char, convert_char_to_digit,
convert_char_to_number, digitize_sequence, digitize_char
digitize_hash
Returns : A ref. of a hash
Usage : with a variable for threshold ->
%out = %{&digitize_char(\%input_hash)};
Version : 1.1
Argument : Takes two ref. for hash
Function : gets two ref. of hashes and calculates the position diffs.
Returns : one ref. for an array of differences of input arrays. array context.
---Example input (a hash with numbers); The values are differences after comparion
with structural and sequential alignments.
%diffs =('seq1', '112342431111
'seq2', '12222...09011.1122',
'seq3', '13222...00011.1122',
'seq4', '12262...00011.112.');
%rms_corrected_0_or_1 => seq1_seq2 0111011111011101011110100101101010011
seq1_seq3 01111.....111110111111111111100001011
example output;
seq3_seq4 01040...00000.000.
seq1_seq2 01012...1810...122
seq1_seq3 02012...1110...122
seq1_seq4 01032...1110...12.
seq2_seq3 01000...09000.0000
seq2_seq4 00040...09000.000.
Usage : %position_diffs =\{&get_posi_diff_hash(\%diffs, \%rms_corrected)};
Version : 1.0
Warning : split and join char is ",";
Argument : takes 4 hash REFERENCES for (one seq. and one struc. alignment(2nd arg)
Returns : two refs. for scalar values of shift rate of positions for proteins.
frirst scalar is rate without correcting rms deviation
second scalar is rate with correcting rms deviation
>> example of xx
1cdg APDTSVSNKQ NFSTDVIYQI FTDRFSDGNP ANNPTGAAFD GTC.TNLRLY
2aaa ......LSAA SWRTQSIYFL LTDRFGR... ....TDNSTT ATCNTGNEIY
>> example of xx
2aaa ------lsaasWrtqSIYFLLTDRFGrtdns-------ttatCntgneiy
1cdg apdtsvsnkqnFSTDVIYQIFTDRFsdgnpannptgaafdgtCtn-lrly
>> example of xx
1cdg APDTSVSNKQ NFSTDVIYQI FTDRFSDGNP ANNPTGAAFD GTCTN-LRLY
2aaa ------LSAA SWRTQSIYFL LTDRFGRTDN S-------TT ATCNTGNEIY
1cdg_2aaa ------7774 2221210000 0000000148 9-------99 41114-4000
1cdg_6taa ------8674 2232220000 0000011059 9-------99 52114-3000
Usage : ($rate1_ref,$rate2_ref) =${&get_posi_shift_rms_hash(\%msf_hash, \%jp_hash,
\%rms_file_hash, \$threshold)};
Version : 1.0
Argument : upto 3 arg. 1st one is for the ref. of an array. 2nd for min
element no. 3rd for max element no. 2nd and 3rd are optional.
Returns : a ref. of a hash.
Usage : %final_out_hash=%{&steve_permute_array(\@list, \2, \4)};
Above is for pairs, 3 seqs, and 4 seqs.
Version : 1.0
Argument : gets a ref. of a scaler (dir name) and returns nothing(void).
Example : as in my 'indexing.pl' for perl file indexer.
Function : open dir and process all files in the dir if you wish,
and then go in any other sub
if any file(dir) is linked, it skips that file.
Keywords : open_dir_and_go_in_and_do_something,
go in there do something, get into subdir and do something.
go_in_subdir_and_do_something, recursive execution
Usage : &opendir_and_go_in_and_do_something(\$input_dir);
$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.1
Warning : Seems to work fine., !! Change the name of this sub to shorter one
!! for your own purpose.
Argument : gets a ref. of a scaler (dir name) and returns nothing(void).
Example : as in my 'indexing.pl' for perl file indexer.
Function : open dir and process all files in the dir if you wish,
and then go in any other sub
if any file(dir) is linked, it skips that file.
Usage : &opendir_and_go_in_and_do_something(\$input_dir);
$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.0
Warning : Seems to work fine., !! Change the name of this sub to shorter one
!! for your own purpose.
Argument : Two references of hashes.
Returns : one reference of hash. (eg, 0=>1000, 1=>888, 2=>83, ...
0,1,2... are position shift types
1000, 888, 83... are occurances in
the comparision between str. and seq.
alignments.)
Usage : for single protein group
Version : 1.0
Argument : one ref. of hash (seq1 alsdfjlsj
seq2 asldfjsld
seq3 owiurouou);
Function : gets the numbers of occurances for 1, 2, 3 ... position shifts.
If hash is given, it only looks at the values.
If multiple string, array, hash or combinations of these
are given, it will add up to one single result
Keywords : composition of chars, composition table making,
make_composition, make composition table
occurances_of_char, get_char_occurances, occurances
get_percentage_occurances_of_char, percentage_occurances_of_char
Options : 'p' for percentage output of the char among others
'n' for NO name option when HASH input is given
Returns : one ref. of hash (a =>5, b=>6, c=>4,,,,,)
Usage : %occurances_shft_type=%{&get_occurances_of_char(\%final_posi_diffs)};
%char_occur=%{&get_occurances_of_char(\@ref_array_of_chars)};
%char_occur=%{&get_occurances_of_char(\$ref_string_of_chars)};
%char_occur=%{&get_occurances_of_char($string_of_chars)};
Version : 1.3
Argument : one ref. of hash (seq1 alsdfjlsj
seq2 asldfjsld
seq3 owiurouou);
Function : gets the numbers of occurances for 1, 2, 3 ... position shifts.
Keywords : composition of chars, composition table making, make composition table
make_composition_table, get_composition, get_amino_acid_composition
protein_composition, make_aa_composition_tablem, aa_composition
Returns : one ref. of hash (a =>5, b=>6, c=>4,,,,,)
Usage : %occurances=%{&make_compos_table(\%key_and_value_for_seq)};
Version : 1.2
Argument : one ref. of hash (seq1 alsdfjlsj
seq2 asldfjsld
seq3 owiurouou);
Function : gets ratio of the numbers of occurances for any chars.
Keywords : composition table, composition of chars, composition table making,
make composition table, make_composition_table
Returns : one ref. of hash (a =>0.05, b=>0.06, c=>0.04,,,,,)
Usage : %occurances=%{&make_compos_ratio_table(\%final_posi_diffs)};
Version : 1.0
Warning : This pools all the sequences, to not distinct seq composition if
you put more than one seq.
Argument : one or more ref. of hash (seq1 alsdfjlsj
seq2 asldfjsld
seq3 owiurouou);
Function : gets ratio of the numbers of occurances for any chars.
Keywords : composition table, composition of chars, composition table making,
make composition table, make_composition_table
aa_composition_ratio, composition_ratio, protein_composition,
get_composition_ratio, get_aa_composition_ratio
Returns : one ref. of hash ('seq_name', { a =>0.05, b=>0.06, c=>0.04,,,,, } )
Usage : %rate=%{&make_compos_ratio_table(\%hash1, \%hash2, ,,,)};
Version : 1.3
Warning : This produces each composition ration table for each seq
Argument : %{&get_position_shift_rate(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Example : my(%error_rate)=%{&get_position_shift_rate(\%input, \%input2)};
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment. Takes two file names of seq.
Output >>
seq1_seq2 1110...222...2222
seq2_seq3 1111....10...1111
seq1_seq3 1111....0000.0000
Options : 'ss' for secondary structure regions(Helix and Beta region only
calculation for error rate). There is specialized sub called
get_segment_shift_rate for sec. str. only handling.
$ss_opt becomes ss by ss, SS, -ss, -SS # for secondary structure only
$H = 'H' by -H or -h or H # to retrieve only H segment
$S becomes 'S' by -S or S # to retrieve only S segment
$E becomes 'E' by -E or E # to retrieve only E segment
$T becomes 'T' by -T or -t or T or t # to retrieve only T segment
$I becomes 'I' by -I or I # to retrieve only I segment
$G becomes 'G' by -G or -g or G or g # to retrieve only G segment
$B becomes 'B' by -B or -b or B or b # to retrieve only B segment
$HELP becomes 1 by -help # for showing help
$simplify becomes 1 by -p or P or -P, p
$simplify becomes 1 by -simplify or simplify, Simplify SIMPLIFY
$comm_col becomes 'C' by -C or C or common
$LIMIT becomes L by -L, L # to limit the error rate to 9 .
Returns : \%final_posi_diffs;
Usage : %rate_hash = %{&get_position_shift_rate(\%hash_msf, \%hash_jp)};
Version : 1.5
Warning : split and join char is ','; (space)
Argument : %{&get_posi_rates_hash_out(\%msfo_file, \%jpo_file)};
Whatever the names, it takes one TRUE structral and one ALIGNED hash.
Function : This is to get position specific error rate for line display rather than
actual final error rate for the alignment.
Output >>
seq1_seq2 1110...222...2222
seq2_seq3 1111....10...1111
seq1_seq3 1111....0000.0000
Returns : \%final_posi_diffs;
Usage : %rate_hash = %{&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.0
Warning : split and join char is ','; (space)
Argument : Takes a ref. for hash which have positions of residues of sequences.
Function : gets a ref. of a hash and calculates the position diffs.
Returns : one ref. for an array of differences of input arrays. array context.
---Example input (a hash with sequences); The values are differences after comparion
with structural and sequential alignments.
%diffs =('seq1', '112342431111
'seq2', '12222...09011.1122',
'seq3', '13222...00011.1122',
'seq4', '12262...00011.112.');
example output;
seq3_seq4 01040...00000.000.
seq1_seq2 01012...1810...122
seq1_seq3 02012...1110...122
seq1_seq4 01032...1110...12.
seq2_seq3 01000...09000.0000
seq2_seq4 00040...09000.000.
Usage : %position_diffs =\{&get_posi_diff_hash(\@seq_position1, \@seq_position2)};
Version : 1.0
Warning : split and join char is ','; # used in 'get_posi_shift_hash'
Argument : takes two hash REFERENCES for (one seq. and one struc. alignment(2nd arg)
Returns : One scalar value of shift rate of position for proteins.
Usage : $rate_final = ${&get_posi_shift_hash(\%hash_msf, \%hash_jp)};
Version : 1.1
Warning : split and join char is ','; (space)
Function : gets a ref(s) for hash and prints the content in lines of 60 char
Returns : Nothing, STDOUT
Usage : &print_seq_in_block (\%input_hash1,\%input_hash2, \%input_hash3.... );
Version : 1.0
Warning : derived from print_in_block
Argument : (\%input1, \%input2, \%input3.....);
Function : fills the ending gaps or space of sequences (shorter ones)
Returns : (\%hash1,..... )
Usage : (*out, *out2, *out3)=&fill_ending_space(\%input1, \%input2, \%input3);
&print_seq_in_block(\%out,\%out2,\%out3); <-- if you want printout.
Version : 1.0
Argument : one or more refs. for hash
if there are more than one array input it makes such outputs
Name1 THIS.IS.from.hash.one
Name2 This
Name1 THIS
Name2 This.is.from.hash.two
Function : gets a ref(s) for hash (single key and value)
and prints the content in lines of 60 char
Returns : Nothing, STDOUT
Usage : &print_seq_in_block_old (\%input_hash1,\%input_hash2, \%input_hash3.... );
Version : 1.0
Warning : This is more or less for debugging. Use print_seq_in_block
Argument : one or more refs. for array
if there are more than one array input it makes such outputs
Example out)
THIS.IS.from.array.one
This.is.from.array.two
THIS.IS.from.array.one
This.is.from.array.two
Function : gets a ref(s) for array and prints the content in lines of 60 char
Returns : Nothing, STDOUT
Usage : &print_in_block (\@input_array,\@input_array2, \@input_array3.... );
Version : 1.0
Warning : This is more or less for debugging. Use print_seq_in_block
Argument : Takes two ref. for arrays which have positions of residues.
Example : @compacted_posi_dif =(1 ,2, 1, 1, '.' ,2, 1, 1, '.');
@compacted_posi_dif2=(4 ,2, 1, 1, ,2, 1, '.' ,3, 1);
output ==> ( 3 0 0 0 . 1 . 2 .) (it ignores positions which have non digits.
output ==> (-3 0 0 0 . 1 .-2 .) when abs is not used.
Returns : one ref. for an @array of differences of input arrays. array context.
Usage : @position_diffs =&get_posi_diff(\@seq_position1,\@seq_position2);
Version : 1.4
Argument : Takes two ref. for arrays which have positions of residues.
Example : @compacted_posi_dif =(1 ,2, 1, 1, '.' ,2, 1, 1, '.');
@compacted_posi_dif2=(4 ,2, 1, 1, ,2, 1, '.' ,3, 1);
output ==> ( 3 0 0 0 . 1 . 2 .) (it ignores positions which have non digits.
output ==> (-3 0 0 0 . 1 .-2 .) when abs is not used.
Returns : one ref. for an @array of differences of input arrays. array context.
Usage : @position_diffs =&get_posi_diff_abs(\@seq_position1,\@seq_position2);
Version : 1.0
Argument : takes two refs for arrays (one for char the other for digits
Example : @string_from_struct=('X', 'T', 'A' ,'B' , '.' ,'F', 'G', '.' , 'O' ,'P', '.');
@compacted_posi_dif=(1 ,2, 1, 1, ,2, 1, 1, 1);
Returns : a ref. for an array
Usage : @result =@{&put_position_back_to_str_seq(\@string_from_struct, \@compacted_posi_dif)};
Version : 1.0
Function : caculates the error rate of seq after filtering according to
rms deviation.
Usage : $result=${&get_posi_shift_hash_rm(\%h1, \%h2, \%h3)};
Version : 1.0
Warning : Not complete yet.
Function : reads xxx.fil file which shows whether I have to discard
regions of sequences due to too big RMS deviation.
Returns : a ref. for a hash(associative array).
Usage : %out = %{&open_fil_file(\$input_seq_file)};
Version : 1.0
Warning : !!! not yet complete !!!
Example :
send_mail ( $to, $subject, @lines );
#-# i -- $to = email address
#-# i -- $subject = string to be put in the Subject: line
#-# i -- @lines = lines to be mailed - must not have \n
-- DISCUSSION:
Uses /usr/lib/sendmail to mail a bunch of lines to the email address
specified. The @lines should not have terminating \n characters: they
will be supplied.
-- EXAMPLE:
&P10::mail ( 'schip@lmsc.lockheed.com', 'Test 34', @mylines );
-- END
: Could some one share their knowledge of how to mail a message from
: within a Perl script with a novice Perl user?
Function : mail a bunch of @lines to a user
Version : 1.0
Function : This sub routine should return an alphabet string of
length specified by an argument.
Keywords : randomize words, makes random words, scramble_word,
shuffle_words,
Usage : $word = ${&rand_word(7)};
print "sub rand_word gives $word\n";
Version : 1.0
Example : $inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
&opendir_and_go($inputdir);
Function : open dir and process all files if you wish, and then go in any sub
dir of it. Using recursion. created by A Biomatic
if any file is linked, it skips that file.
Usage : &opendir_and_go_rand_fasta_and_clustal(\$input_dir); #$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.0
Warning : Seems to work fine.
Example : $inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
&opendir_and_go($inputdir);
Function : open dir and process all files if you wish, and then go in any sub
dir of it. Using recursion. created by A Biomatic
if any file is linked, it skips that file.
Usage : &opendir_and_go_rand_fasta(\$input_dir); #$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.0
Function : gets a ref. of an string, reverses the elems.
Returns : one ref. of mul_array, eg. ('jfkdj', 'kdfjsdj', 'jjjkk')
Usage : @out = @{&rev_sequence_mul_array(\@input_mul_seq_array)};
Version : 1.0
Warning : This reverses sequences!
Function : shuffles the elements of array
Keywords : randomise_array, randomize_array, shuffle_array
Usage : @in=@{&scramble_array(\@in)};
Version : 1.4
Argument : one ref. of mul_array, eg. ('lsjdfj', 'kdfjsdj', 'jjjkk')
Function : gets a ref. of an string, scambles the elem.
Keywords : scramble_sequence_mul_array, shuffle_sequence_mul_array
Returns : one ref. of mul_array, eg. ('jfkdj', 'kdfjsdj', 'jjjkk')
Usage : @out = @{&rand_sequence_mul_array(\@input_mul_seq_array)};
Version : 1.1
Warning : This scrambles sequences!!
Argument : one ref. of string, eg ( 'ldkfjlsdjfsdjflj' )
Function : gets a ref. of a string, scambles the elem.
Returns : one ref. of string,
Usage : @out = @{&rand_sequence_one_string(\$input_seq_string)};
Version : 1.0
Warning : This scrambles sequences!!
Argument : one ref. of array, eg ('e', 'b', 'c', 'd')
Function : gets a ref. of an array, scambles the elem.
Returns : one ref. of array,
Usage : @out = @{&rand_sequence_one_array(\@input_seq_array)};
Version : 1.0
Warning : This scrambles sequences!!
Argument : 1 200 [-p] [@array_of_array_refs]
1 = num of seq, 200=leng of seq, -p =option, @arr.. = option
You can optionally give amino acid matrices
Example : $out=${&make_random_sequence(@ARGV)}; While @ARGV can be '1 200 -p'
Function : gets one or more numbers for seq length and makes random sequences
It can handle proportional random sequenes according to the
amino acid occurance matrix.
Keywords : scramble_sequence, make_scrambled_sequence, shuffle_sequence
random_sequence, make_random_sequence, generate_random_protein_seq
create_random_sequene create_random_aa_sequence
Options : 'p' for proportional random sequence option
'f' for fastsa format output (returns one ref. of HASH)
Returns : one or more scalar references according to the input numbers.
Usage : $protein = ${&make_random_sequence(1, 400)};
Version : 1.4
Argument : (343) or (\$length)
Function : gets one or more numbers for seq length and makes random sequences
Returns : one or more scalar references according to the input numbers.
Usage : $DNA = ${&rand_DNA_seq_generate(400)};
Version : 1.0
Argument : (343) or (\$length)
Function : gets one or more numbers for seq length and makes random sequences
Returns : one or more scalar references according to the input numbers.
Usage : $DNA = ${&rand_RNA_seq_generate(400)};
Version : 1.0
Argument : reference of one array of file names in pwd
Function : finds patterns of text and replaces them in multiple input files
Keywords : replace_txt, change_text,
Returns : nothing
Usage : &replace_text(\@input_array_of_filenames);
Version : 1.4
Warning : This produces a temporary file and rename it...
Argument : one hash reference for sequences.
Function : gets hash of sequence, compares lengths, and outs av.
Returns : one ref. for scaler digit.
Usage : $std_devi_of_lengths = &get_av_seq_length(\%hash_ref);
Version : 1.0
Warning : uses a sub &array_average(\@lengths);
Argument : gets one hash reference,
Returns : one scaler digit
Usage : $result = &get_sd_of_length_diff(\%input);
Version : 1.0
Warning : removes all non-char(.-, space....) in the input string
Argument : Two hash references for sequences.
Function : gets ref of hash of sequence, compares lengths, and outs av.
Returns : Two scaler digit.
Usage : $get_av_and_sd_seq_length= &get_av_seq_length(\%hash_ref);
Version : 1.0
Warning : uses a sub &array_average(\@lengths);
Argument : one scalar variable input of sequence string.
Returns : the positions of residues after removing gaps(but keeps pos).
used for analysis of shifted positions of seq. comparison.
Usage : @seq_position1 = &get_posi_sans_gaps($string1);
Version : 1.0
Argument : takes two file names for seq. and struc. alignment.
: Assumes the files are in the pwd.
Returns : one ref. for scalar value of shift rate of position for proteins.
Usage : $rate_final = &get_posi_shift_rate("perl.msf", "perl.jp");
Version : 1.0
Warning : sub hash_common was unstable.
Function : read hssp file and put sequences in a hash
Usage : %anyarray = &read_hssp_no_inserts ($any_sequence_file_hssp_form);
Version : 1.0
Warning : It produces incomplete sequences when hssp seqs. have insertions.
Example : %out = %{&open_pdbg_files(@ARGV)};
while @ARGV at prompt was: 'pdb_40.pdbg'
Function : open pdb group files and put scopclass in a hash.
PDB group file format is like this;
>d1bia_1 1.4.3.1.1 (1-63) Biotin repressor, N-terminal domain [Escherichia coli]
>d1baba_ 1.1.1.1.15 Hemoglobin, alpha-chain [human (Homo sapiens)]
>d1cpcb_ 1.1.1.2.1 C-phycocyanin [cyanobacterium (Fremyella diplosiphon)]
>d1fcdc2 1.3.1.3.1 (81-174) Flavocytochrome c sulfide dehydrogenase, FCSD, cytochrome subunit [Purple phototrophic bacterium (Cromatium vinosum)]
This can also return the sizes of sequences rather than seqs.
Keywords : open_pdbg_files, open_pdb_group_files
Options : any digit for the minimum seq length
b for simple style reading (this reads in the name of pdbg file as it is)
Usage : %seq=%{&open_pdbg_files($tim_seq_file, ['1fcdc1'], [s] )};
if you put additional seq name as 1fcdc1 it will
fetch that scopclass only in the database file.
Any digit will be used as minimum seq size to be fetched.
Version : 1.5
Example : %out = %{&open_fasta_files(@ARGV)};
%out2=%{&open_fasta_files('seq.fa', \%index)};
%out3=%{&open_fasta_files('seq.fa', \%range)};
%seq=%{&open_fasta_files($PDB40_FASTA, \@seq_to_fetch)};
while @ARGV at prompt was: 'GMJ.pep MJ0084'
Function : open fasta files and put sequences in a hash
If hash(es) is put which has sequence names and seek position
of the index file, it searches the input FASTA file to
fetch at that seek position. This is useful for Big fasta DBs
If the seq name has ranges like XXXXXX_1-30, it will only
return 1-30 of XXXXXX sequence.
FASTA sequence file format is like this;
> 1st-seq
ABCDEFGHIJKLMOPABCDEFGHIJKLMOPABCDEFGHIJKLMOPABCDEFG
> 2nd.sequ
ABCDEFGHIJKLMOYYUIUUIUIYIKLMOPABCDEFGHIJKLMOPABCDEFG
>owl|P04439|1A03_HUMAN HLA CLASS I HISTOCOMPATIBILITY ANTIGEN, A-3 ALPHA CHAIN PRECURSOR....
MARGDQAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDT
This can also return the sizes of sequences rather than seqs.
This ignores any dup entrynames coming later.
Keywords : open_fasta, open_fa_files, open_FASTA_files,
Options : Seq name to fetch the specified seq only.
as open_fasta_files.pl MY_SEQ_NAME Swissprot.fasta
-d for giving back desc as well as the name. so it
gives 'HI0002 This is the description part'
as the key
If you put hash which is like('seq_name', ['20-30', '30-44',..])
it will produce hash which has got:
( seq_name_20-30 'asdfasdfasdfasdfasd',
seq_name_30-44 'kljkljkjkjljkjljkll',
.... .... )
-s for returning sequence size only
$reverse_seq=r by r ## to reverse seq.
Usage : %fasta_seq=%{&open_fasta_files($fasta_file, ['MJ0084'])};
if you put additional seq name as MJ0084 it will
fetch that sequence only in the database file.
%out=%{&open_fasta_files(@ARGV, \%index)};
while %index has (seq indexpos seq2 indexpos2,,,)
In this case, the fasta file should have xxxx.fa format
Version : 4.1
Function : gets 2 references (one for %hash the other for group $name)
uses &msf_permute_array_write(\%hash, \$group_name)
the second arg is for output file name. can be anything.
Usage : &msf_permute_hash_write(\%hash, $group_name); # void
Version : 1.0
Argument : gets 2 references
Function :
the second arg is for output file name. can be anything.
used in &msf_permu_hash_write
Usage : &msf_permu_array_write(\%hash, \$group_name); # void
Version : 1.0
Function : gets a reference of hash which has names and sequences as keys and values.
uses &pir_permute_array_write
the second arg is for output file name. can be anything.
Usage : &pir_permute_hash_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets a reference of a hash which has names and sequences as keys and values.
uses &fasta_permute_array_write
the second arg is for output file name. can be anything.
Usage : &fasta_permute_hash_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets a reference of an array which has names and sequences as keys and values.
the second arg is for output file name. can be anything.
used in &fasta_permu_hash_write
Usage : &fasta_permu_array_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets a reference of hash which has names and sequences as keys and values.
uses &ssp_permute_array_write
the second arg is for output file name. can be anything.
Usage : &ssp_permute_hash_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets a reference of hash which has names and sequences as keys and values.
the second arg is for output file name. can be anything.
used in &pir_permu_hash_write
Usage : &pir_permu_array_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets a reference of hash which has names and sequences as keys and values.
the second arg is for output file name. can be anything.
used in &ssp_permu_hash_write
ssp file is for PHD secondary structure prediction service.
Usage : &ssp_permu_array_write($hash_ref, $group_name); # void
Version : 1.0
Function : gets permutated array elements except single char elements.
fastest
Usage : &permute(\@array);
Version : 1.0
Warning : from : Kenneth Albanowski CIS: 70705,126)
Example : &ssp_write($hash_pointer, $out_file_name);
Function : writes multiple seqs. in fasta format (takes one or more than one seq.!!)
ssp is PHD server format.
Usage : two argments: $seq_hash_reference and $output_file_name
takes a hash which has got names keys and sequences values.
uses Perl5 pointers(references).
Version : 1.0
Example : &pir_write($hash_pointer, $out_file_name);
Function : writes multiple seqs. in fasta format (takes one or more than one seq.!!)
pir is PHD server format.
Usage : two argments: $seq_hash_reference and $output_file_name
takes a hash which has got names keys and sequences values.
uses Perl5 pointers(references).
Version : 1.0
Example : &pir_write($hash_pointer, $out_file_name);
Function : writes multiple seqs. in fasta format (takes one or more than one seq.!!)
Usage : two argments: $seq_hash_reference and $output_file_name
takes a hash which has got names keys and sequences values.
uses Perl5 pointers(references).
Version : 1.0
Example : &write_msp3_files(\@files); # while @files has G*.pdbg
Function : opens two files. Gx.msp_1 and Gx.msp_2 to create Gx.msp3 file
you can set the msp3 file extension by e= option,
for example, e=interm will make G1.interm instead of G1.msp3
Keywords : make_msp3_files, create_msp3_files
Options :
$upper_expect_limit2= by u2= # u2 is for msp_2 files (eg, 0.0006)
$upper_expect_limit1= by u1= # u1 is for msp_1 files (eg, 0.081 )
$lower_expect_limit1= by l1=
$lower_expect_limit2= by l2=
R for NOT adding ranges in seq names.
e= for extension name
n for no sort by columns in output
e for sorting columns by E values (first first and then second)
E for sorting columns by E values but reverse order
Returns : returns the names of msp3 files
Usage : &write_msp3_files(\@files);
Version : 1.8
Function : takes xxxx.msp files and writes xxxx.parf file
Keywords : write_parf
Options :
$pdbd_seq_only=d by d -d
$sam_571_seq_only=571 by 571 -571
$pdb95d_2092_seq =2092 by 2092 -2092
$ISS_2nd_Eval_factor= by E= ## "E=$eval"
$PDB40D_935_FASTA= 935 by 935
$use_raw_score=r
$use_eval_but_show_raw_score=e by e -e ## eval order but only raw score is shown.
This is to make a special graph
requested by David Haussler
Usage : &write_parf_files(@ARGV);
Version : 2.4
Author : jong@salt2.med.harvard.edu
Function : This produces EVSS file(Error VS Score) from PARF file
Keywords : get_score_vs_error_from_parf_files.pl
Options :
d=$query_number for dividing the errors by all the query number
$negate_score=n by n -n # to make sign change for PSI scores
$error_per_query=q by q -q # divide the error by query number
$log_of_errors=l by l -l
$log_of_evalue_or_score=e by e -e
$get_log_base_10=t by t -t
Usage : @files_produced=@{write_evss_files(\@files)};
Version : 1.5
Argument :
$sort_seq_names=s by s ## in writing sorted sequences are written
$write_rv_seq_as_well=R by R # write reverse seq as well as forward seq
Example : &write_fasta(\%in1, \$out_file_name, \%in2, \%in3,..., );
<< The order of the hash and scalar ref. doesn't matter. >>
Function : writes multiple seqs. in fasta format (takes one or more seq.!!)
This needs hash which have 'name' 'actual sequence as value'
To print out each fasta seq into each single file, use write_fasta_seq_by_seq
This can rename seq names
Keywords : write_fasta_file, print_fasta_file, write fasta file, fasta_write
show_fasta, write_sequence_fasta, write_fasta_files,
Options : v for STD out.
r for rename the sequences so that Clustalw would not complain with 10 char limit
so result wuld be: 0 ->ASDFASDF, 1->ASDFASFASF, 2->ADSFASDFA
$write_pure_seq_only=o by o -o ## writing only the seq (no gap chars or space)
Usage : many argments: $seq_hash_reference and $output_file_name
takes a hash which has got names keys and sequences values.
Version : 3.0
Warning : The default output file name is 'default_out.fa' if you do not
specify output file name.
OUTput file should have xxxxx.fa or xxxx.any_ext NOT just 'xxxxx'
Example : with >xxxx
ASDFASDFASDFASDFASDFASDFASDF
>yyyy
ASDFASDFASDFASDFASDFASDFSDAFSD
You will get two files (xxxx.fa, yyyy.fa)
Function : accepts one hash of multiple sequences and writes many files
of single sequences by using the names as file names.
If $extension is provided, it writes an output as in
the below example (seq1_sc.fasta). If not, it just attach
'fa' to files.
This needs, hash of 'name', 'actual sequence as value'
Keywords : write_each_fasta, write_single_fasta, write_fasta_single
single_fasta_write, write_fasta_files_seq_by_seq,
write_single_fasta_files,
Options : can specify extension name.
e for checking fasta file exists or not and skipps if so
r for rename the sequences so that Clustalw would not complain with 10 char limit
so result wuld be: 0 ->ASDFASDF, 1->ASDFASFASF, 2->ADSFASDFA
$write_rv_seq_as_well=R by R # write reverse seq as well as forward seq
$extension= by E=
Returns : nothing. default OUTPUT file name is '$key.fa' !!
Usage : &write_fasta_seq_by_seq(\%hash, [$extension], [\$output_filename]);
Version : 2.1
Example : &write_pred_files(\%gapped_av_for_back_pred, $final_output_pred_name,
$graphical_rep_of_str, "$put_reliability_line");
Keywords : write_predator_short_out_file, write_pred_file, write_prd_file
Options :
$put_reliability_line=r by r
$omit_coil_region=c by c
$protein_name= by n=
$graphical_rep_of_str=g by g
$show_on_screen_only=s by s
$seq_block_size= by b=
Version : 1.4
Example : &show_in_fasta(\%hash);
Function : shows multiple seqs. in fasta format (takes one or more seq.!!)
Usage : &show_hash_in_fasta(\%in1, \%in2, \%in3, .... );
takes a hash which has got names keys and sequences values.
uses Perl5 pointers(references).
Version : 1.0
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Keywords : 1_to_3
Usage : %one_letter = %{&One_To_Three_Letter}; # takes no arguments (void).
Version : 1.0
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Keywords : 1_to_3, convert_1_to_3_letter
Usage : %one_letter = %{&ONE_TO_THREE_LETTER }; # takes no arguments (void).
Version : 1.0
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Usage : %one_letter = %{&one_to_three_letter}; # takes no arguments (void).
Version : 1.0
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Keywords : 321, 3to1 3_to_1 THREE_TO_ONE_LETTER Three_To_One_Letter
convert_3_to_1, convert_3_to_1_aa_name
Usage : %three_letter = &three_to_one_letter ; # takes no arguments (void).
Version : 1.1
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Keywords : 321, 3to1 3_to_1 THREE_TO_ONE_LETTER Three_To_One_Letter
convert_3_to_1, convert_3_to_1_aa_name
Usage : %three_letter = &three_to_one_letter ; # takes no arguments (void).
Version : 1.1
Function : a hash of one letter to 3 letter amino acid code , returns a hash
Keywords : 123, 1to3 1_to_3 one_TO_three_LETTER One_To_Three_Letter
convert_1_to_3, convert_1_to_3_aa_name
Usage : %three_letter = &three_to_one_letter ; # takes no arguments (void).
Version : 1.1
Argument : hash of at least 2 sequences.
Function : gets amino acid composition identity of any given
number of sequences(at least 2).
Keywords : get_amino_acid_composition, get_protein_composition, composition
Usage : $percent = &amino_acid_compos_id_percent (%any_hash_with_sequences);
The way identity(composition) is derived is;
Version : 1.1
Function : produces amino acid composition identity of any given number of sequences.
Keywords : get_percent_composition_identity, seq_composition_identity,
percent_sequence_composition_id
Usage : $percent = &seq_id_percent_array(@any_array_sequences);
The way identity(pairwise) is derived is;
Version : 1.0
Warning : This can handle 'common gaps' in the sequences
Function : produces amino acid composition identity of any given number of sequences.
Usage : $percent = &compos_id_percent_array(@any_array_sequences);
The way identity(composition) is derived is;
Version : 1.0
Function : gets amino acid composition identity of any given number of sequences.
Keywords : get_amino_acid_composiiton
Usage : $percent = &compos_id_percent_hash(%any_hash_with_sequences);
The way identity(composition) is derived is;
Version : 1.0
Argument : two references of hash of seqeunces.
Example : ('A', 200, 'C', 191, D, 99)
('A', 290, 'C', 199, D, 100)
uses only two sequences.
Function : actual calculation of identity
Returns : ref. of a scaler (in percent) eg) 95
Usage : %hash = &common_compos_hash(\%any_hash1, \%any_hash1);
Version : 1.0
Argument : two references of hash of seqeunces.
Example : ('A', 200, 'C', 191, D, 99)
('A', 290, 'C', 199, D, 100)
uses only two sequences.
Function : actual calculation of identity
Returns : ref. of a scaler (in percent) eg) 95
Usage : %hash = &calc_compos_hash(\%any_hash1, \%any_hash1);
Version : 1.0
Argument : ref. for Scalar string or Array of chars or Hash AND 'the target char'
Example : if the string is 'seq ABCDEEEEEFFEFE' given in a hash
if you put 'A' as one argument, it counts the occurances of 'A'
and gets the percentage of it.
Function : calculates the percentage content of any single char over the whole
length of strings in it.
Keywords : get_percentage_of_char
Options : None yet.
Returns : Numerical Percentage
Usage : %out= %{&get_percentage(\%result, '1')};
Version : 1.0
Warning : This converts array and string input as ref. into arbitrary hash and
returns hash
programmed by A Biomatic
Function : takes a ref. of a hash of names and sequences, returns
percent identity.
Usage : $identity = ${&pairwise_percent_id(%arrayinput)};
Version : 1.0
Argument : hash(es) of sequences.
Function : takes a ref. of a hash of names and sequences, returns
percent identity. NOT composition identity.
Keywords : get_sequence_identity
Usage : $identity = ${&get_seq_identity(%arrayinput)};
Version : 1.0
Argument : two sequence files which have identical sequence names.
Function : accepts two files and prints out the sequence identities of the alignment.
Options : h # for help
v # for verbose printouts(prints actual sequences)
Returns : reference of Scalar for percentage correct alignment(for already
aligned sequences)
Usage : &get_correct_percent_alignment_rate(\$file1, \$file2);
Warning : Alpha version, A Biomatic , made for Bissan
Function : returns a table of alphabet with occurances.
can handle any char, this converts char to upper case.
Returns : %hash1 = ('A',3, 'C',2, 'D',1, 'Q',2, 'S',1), %hash2,,,
Usage : %output = %{&compos_table(@input_array1, @input_array2,,,,)};
example input
Warning : converts all SMALL letters to Capital letters before counting!!
Argument : two references of hash of seqeunces.
Example : common gaps means only '.' (dots, not alphabets!!)
AAA....BBCB
AAAB..B.BCC --> A.A.....BC. (as in an array)
A.AAA...BCA
Returns : a hash (string1, number1, string2, number2, string3, number3, ...)
Usage : %hash = &common_compos_hash(\%any_hash1, \%any_hash1);
Example : common gaps means only '.' (dots, not alphabets!!)
AAA....BBCB
AAAB..B.BCC --> A.A.....BC. (as in an array)
A.AAA...BCA
The resulting array XXXXX..XXXX is literally like so.
This is to detect absurd gaps in the above.
Usage : @array = &pair_percent_id_trend (%arrayinput);
Example : will return 5 with &smaller_one(5, 50);
Function : gets smaller value of the two inputs
Usage : $smaller = & smaller_one($var, $var2);
Warning : gets only digits!!
Function : takes only ARRAY and counts the number of char. Each elem should be
a single char.
Usage : $num_char = &count_num_of_char(@input_array_of_single_char);
Argument : accepts reference for a hash.
Example : seq1 ABCDE------DDD seq1 ABCDE--DDD
seq2 ABCDEE-----DD- ==> seq2 ABCDEE-DD-
seq3 ---DEE----DDE- seq3 ---DEEDDE-
^^^^
from above the 4 columns of gap will be removed
To remove absurd gaps in multiple sequence alignment
Returns : a ref. of a hash.
Usage : %new_string = %{&remov_com_column2(\%input_hash)};
Version : 1.0
Argument : 2 or more ref for hash of identical keys and value length.
One optional arg for replacing space char to the given one.
Author : jong@salt2.med.harvard.edu
Class : get_common_column, get_common_column_in_seq, get common column in sequence
for secondary structure only representation.
Example : %out =%{&get_common_column(\%hash1, \%hash2, '-')};
output> with 'E' option >>> "name1 --HHH--1232-"
Following input will give;
%hash1 = ('s1', '--EHH-CHHEE----EHH--HHEE----EHH--HHEE----EHH-CHHEE--');
%hash2 = ('s2', '--EEH-CHHEE----EEH-CHHEE----EEH-CHHEE----EEH-CHHEE--');
%hash3 = ('s3', '-KEEH-CHHEE-XX-EEH-CHHEE----EEH-CHHEE----EEH-CHHEE--');
%hash4 = ('s4', '-TESH-CHEEE-XX-EEH-CHHEE----EEH-CHHEE----EEH-CHHEE--');
s1_s2_s3_s4 --E-H-CH-EE----E-H--HHEE----E-H--HHEE----E-H-CHHEE--
Function : (name1 --EHH--HHEE-- )
(name2 --HHH--EEEE-- ) ==> result is;
(name1_name2 -- HH-- EE-- )
to get the identical chars in hash strings of sequences.
Keywords : Overlap, superpose hash, overlay identical chars, superpose_seq_hash
get_common_column, get_com_column, get_common_sequence,
get_common_seq_region, multiply_seq_hash, get_common_column_in_sequence
Returns : one hash ref. of the combined key name (i.e., name1_name2). Combined by '_'
Usage : %out =%{&get_common_column(\%hash1, \%hash2, '-')};
Version : 1.6
Warning : This gets more than 2 hashes. Not more than that!
Argument : 2 ref for hash of identical keys and value length. One optional arg for
replacing space char to the given one.
Example : %out =%{&overlay_seq_for_identical_chars(\%hash1, \%hash2, '-')};
output> with 'E' option >>> "name1 --HHH--1232-"
Function : (name1 --EHH--HHEE-- )
(name2 --HHH--EEEE-- ) ==> result is;
(name1_name2 -- HH-- EE-- )
to get the identical chars in hash strings of sequences.
Keywords : Overlap, superpose hash, overlay identical chars, superpose_seq_hash
Returns : one hash ref. of the combined key name (i.e., name1_name2). Combined by '_'
Usage : %out =%{&overlay_seq_for_identical_chars(\%hash1, \%hash2, '-')};
Version : 1.0
Warning : Works only for 2 sequence hashes.
Argument : accepts reference for hash(es) and array(s).
Function : removes common gap column in seq.
Keywords : remove_com_column, remove_common_column,
remove_common_gap_column, remov_common_gap_column,
remove com column
Returns : a ref. of hash(es) and array(s).
name1 ABCDE....DDD name1 ABCDE..DDD
name2 ABCDEE..DD.. --> name2 ABCDEEDD..
name3 ...DEE..DDE. name3 ...DEEDDE.
(ABC....CD, ABCD...EE) --> (ABC.CD, ABCDEE)
from above the two column of dot will be removed
To remove absurd gaps in multiple sequence alignment. for nt6-hmm.pl
Usage : %new_string = %{&remov_com_column(\%hashinput)};
@out=@{&remov_com_column(\@array3)};
Example : XXX...XXX with AAA.....BBBB, The common positions of 3,4,5 deleted
XXX...XXX will be removed in AAA.....BBBB --> AAA..BBBB
XXX...XXX is an @array, while AAA.....BBBB is a value of the input hash
Function : XXX...XXX, and an hash input. removes all the common gap(dots) in targets.
Usage : %result = &remov_common_gap (*common_pos_arr, *target_hash_of_sequence);
Version : 1.0
Argument : gets a ref. of a hash of sequences
Example : common gaps means only '.' (dots, not alphabets!!)
AAA....BBBB
AABB....BBC --> XXXXX..XXXX (as in an array)
..AAA...BCA
This is to detect absurd gaps in the above.
Function : returns X...XXXX, as an array. '.' means common elements.
Keywords : common_gap_pos_hash
Usage : @array = @{&com_elem_pos_hash(%arrayinput)};
Version : 1.0
Example : common gaps means only '.' (dots, not alphabets!!)
AAA....BBCB
AAAB..B.BCC --> A.A.....BC. (as in an array)
A.AAA...BCA
The resulting array XXXXX..XXXX is literally like so.
This is to detect absurd gaps in the above.
Usage : @array = &pairwise_iden_pos(%arrayinput);
Version : 1.0
Argument : one ref. for an inputfile (absolute
>>> PDB example >>>
SEQRES 1 A 284 MET ASP ALA ILE LYS LYS LYS MET GLN MET LEU LYS LEU 2TMA 51
SEQRES 2 A 284 ASP LYS GLU ASN ALA LEU ASP ARG ALA GLU GLN ALA GLU 2TMA 52
Function : Convert a PDB structure file to FASTA format sequences.
Keywords : read_pdb_files{, read pdb files, open pdb files
Returns : One ref. for a hash of sequences(DNA, RNA, PROTEIN (IN diff chains)
If the two chains are identical, it rids of one of them and returns
a name with out chain note--> 2tma, not 2tmaA and 2tmaB
Usage : %out = %{&open_pdb_files(\$VAR)};
Version : 1.7
Warning : (read the sequences only)
Argument : one ref. for an inputfile (absolute
>>> PDB example >>>
SEQRES 1 A 284 MET ASP ALA ILE LYS LYS LYS MET GLN MET LEU LYS LEU 2TMA 51
SEQRES 2 A 284 ASP LYS GLU ASN ALA LEU ASP ARG ALA GLU GLN ALA GLU 2TMA 52
SEQRES 3 A 284 ALA ASP LYS LYS ALA ALA GLU ASP ARG SER LYS GLN LEU 2TMA 53
Function : Convert a PDB structure file to FASTA format sequences.
Returns : One ref. for a hash of sequences(DNA, RNA, PROTEIN (IN diff chains)
If the two chains are identical, it rids of one of them and returns
a name with out chain note--> 2tma, not 2tmaA and 2tmaB
Usage : %out = %{&open_brk_files(\$VAR)};
Function : makes two hashes from ...msf and ..jp files. %array1 is for msf
Usage : &open_msf_jp_files($file1, $file2);
Warning : !!! not very general bettter not use.
msf file is meant to be seq
jp file is meant to be structural alignment (correct seq
msf format is
cofi_human ATFVKM
ici2_horvu RVRLFVDKLD NIA
ici3_horvu RVRLFVDRLD NIA
jp format is;
ycah_ecoli RNVEIV----VID-GVRRFGNIA
icis_vicfa RVRLYVDESNKVV-RAAPIGNIA
ier1_lyces RVRLFVNLLDIVV-QTPKVGNIA
Function : sorts files by creation time. Oldest the first
Keywords : sort_by_time, sort_files_chronically
Options : _ for debugging.
# for debugging.
Usage : @files = @{&sort_files_by_time(\@files)};
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Function : sorts any hash by its values and returns ref. of sorted hash values
with keys attached. So, if the input key value were
key1 value1, the result will be an element 'value1 key1' as
a string
Keywords : sort_hash_by_value, sort_hash, sort_by_values,
Options : -n for numerical sort(not working yet)
Usage : @values_sorted =@{&sort_hash_by_value_and_make_array(\%assoc)};
Version : 1.1
Warning : The same values will be overwritten.
Function : sorts any hash by its values and returns ref. of sorted hash values
Keywords : sort_hash_by_value, sort_hash, sort_by_values, sort_by_value
Usage : @values_sorted =@{sort_by_by_values(\%assoc)};
Version : 1.1
Warning : The same values will be overwritten.
Function : sorts any hash by its values and returns ref. of sorted hash values
Keywords : sort_hash_by_keys, sort_hash, key_sort
Usage : @values_sorted =@{sort_by_by_values(\%assoc)};
Version : 1.0
Function : sorts any hash by its values and returns ref. of sorted hash values
Keywords : sort_hash_by_keys, sort_hash, key_sort
Usage : @values_sorted =@{sort_by_values(\%assoc)};
Version : 1.0
Function : sorts any hash by its values and returns ref. of sorted hash values
Keywords : sort_hash_by_value, sort_hash, value_sort,
Usage : @values_sorted =@{sort_hash_by_values(\%assoc)};
Version : 1.0
Function : sorts strings in array according to their sizes
bigger comes first.
Keywords : sort_array_by_length, sort_str_by_length, sort_array_string_by
sort_string_by_leng, sort_by_length, sort_by_leng,
sort_array_by_string_length, sort_array_elements_by_string_length
Options : -r reverse the order
Usage : @output = @{&sort_string_by_length(@any_input_strings, [-r], @more)};
Version : 1.2
Example : ($name,$aliases,$addrtype,$length,@addrs)=&get_host_by_addr($var); while $var = "13.13.12.12";
Keywords : get_host_by_address, get_hostname_by_address
Usage : ($name,$aliases,$addrtype,$length,@addrs)=&get_host_by_addr('131.111.137.11'); or
Version : 1.0
Example : ($name,$aliases,$addrtype,$length,@addrs)=&get_host_by_name($var);
while $var = "ind4";
Usage : ($name,$aliases,$addrtype,$length,@addrs)=&get_host_by_name('ind4'); or
Version : 1.0
Warning : ! not working yet.
Returns :
The string with newlines replacing spaces in appropriate places.
Usage : &word_wrap($line_to_format)
Version : 1.0
Warning :
The following subroutine does word wrapping on a text string
Example : Output: item1
Output: item2
Output: item3
Function : for debugging purpose. Shows any array elem line by line.
Options : -h for horizontal display of elements
c for compact (do not put new line between array chunk)
s for putting new line between arrays
Usage : &show_array(\@input_array);
Version : 2.4
Warning : can handle scalar ref, too.
Example : Output: item1
Output: item2
Output: item3
Function : for debugging purpose. Shows any array elem line by line.
the line is 60 elements long (uses recursion)
Options : -s or -S or s or S for spaced output. Eg)
seq1 1 1 1 1 1 1 1 1 1 1 1 1
instead of
seq1 111111111111
-h or -H or h or H for horizontal line of '---------...'
Usage : &show_hash(\@input_array);
Version : 1.7
Warning : There is a global variable: $show_hash_option
It tries to detect any given sting which is joined by ','
Example :
There are 2 types of output. The short output:>
> MOZ_HUMAN_part
. . . . .
1 LDHKTLYYDVEPFLFYVLTQNDVKGCHLVGYFSKEKHCQQKYNVSCIMIL 50
___EEEEEE__HHHHHHH_______EEE____________EEEEEEEEE_
((-l option for long output )
NAME MOZ_HUMAN_part
HEADER |- Residue -| Pred Rel NAli Asn
PRED 1 MET M c 0.000 0 ?
PRED 2 ALA A c 0.000 0 ?
Function : gets sec. str. prediction of predator and puts in hash
If 's' option is given, it also gives sequence hash ref
as the second output ref. This can handle the 2 types
of output format of predator. So, the output can will
be different according to inputs.
Keywords : open_prd_files, open_pred_files, predator, open_prdl_files
open_pre_files, secondary structure prediction file
Options : 's' for sequence output as well (\%sec_str, \%seq)
'p' for percentage of the sec. str.
'a' for accumulated percentage. This will
set 'p' automatically
'n' for NO name when outputing Percentage of chars with
HASH input to get_occurances_of_char sub.
$reverse_residue_order=r by r
Version : 1.8
Argument : one or more file names and options. Files should be PHD server's result.
Function : open phd files and put sequences in a hash(s) (run open_phd_files.pl to
get some ideas on how this works. type 'open_phd_files.pl xxx.phdo s',
it will produce 5 different hashes of secondary structure pred.
Options : $secondary, $access, $PHD_sec, $Rel_sec, $prH_sec, $prE_sec, $prL_sec,
$prL_sec, $SUB_sec, $P_3_acc, $PHD_acc, $Rel_acc, $SUB_acc);
$attach_class_info_in_seq_name=c by c ## this makes seq_name seq_name_PHD_s
$simple_seq_with_name_hash=s by s
Returns : one or more hashes(ref.) secondary structure prediction of PHD server
--- The PHD secondary server output which are read by open_phd_files -----
1 => PHD sec | HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHH|
2 => Rel sec |987544342178899999999987678999998478999999999995679771688999|
3 => prH sec |001222323478899999999987778999998678999999999986110115788999
4 => prE sec |000010000101000000000000010000000000000000000000000000010000
5 => prL sec |987666565410000000000001110000001211000000000002789774100000
6 => SUB sec |LLLL
7 => P_3 acc |eeeeeeeeee bbeeebbbebbbbebeeee b bbebbebb eebeebe eee eebbeb|
8 => PHD acc |988787787630066600060000606667515007007005760671847885760160
9 => Rel acc |979685546222352421667053233245604127749164753790316552446141
0 => SUB acc |eeeeeeeee
types of PHD output, like 1 for 'PHD sec', 2 for 'Rel sec' etc.
Usage : &open_phd_files(\$file_name, $options,,,,,);
Version : 1.6
Warning : All the spaces are converted to '_'
Function : open swiss files and puts ONLY the sequences in a hash(s)
Keywords : open_swiss_seq_files, open_swiss_seq, read_swissprot_seq_files,
read_swiss_seq, get_swissprot_seq, take_swissprot_seq,
Options : 'v' for STDOUT printout as well.
Version : 1.2
Warning : ONLY the seq.
Example : Clu file eg)
Cluster 7360103
1 1 SLL1058 7-255 2 Origin: 3 736 Sub:3
1 1 MJ0422 17-283 2 Origin: 3 736 Sub:3
1 1 HI1308 3-245 2 Origin: 3 736 Sub:3
Keywords : open_cluster_files,
Options : _ for debugging.
# for debugging.
b for to get just names ($simple_clu_reading)
r for adding ranges in the names
U for makeing sequence names upppercase
Returns : a ref of hash of $clus{"$clus_size\-$id"}.=$m."\n";
Actual content:
3-133 => 'HI00111 HI00222 MG1233 '
Usage : %clus=%{&open_clu_files(\$input)};
Version : 1.9
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
This automatically converts lower to upper letters
Argument : (\$inputfile1, \$inputfile2, .... )};
Function : open msf files and put sequences in a hash(s)
Options :
$no_gap_char_included=n by n ## to remove gaps noted by '.'
$reverse_seq=r by r
$produce_seq_oder_info=o by o
Returns : (*out, *out2) or (@out_array_of_refs)
Usage : (*out, *out2) = @{&open_msf_files(\$inputfile1, \$inputfile2)};
: %hash_seq = %{&open_msf_files(\$inputfile1)};
: (@out) = @{&open_msf_files(\$inputfile1, \$inputfile2)};
---------- Example of MSF ---
PileUp
MSF: 85 Type: P Check: 5063 ..
Version : 1.7
Function : hmmls matches the full length model to target seq. while, hmmfs
does for fragments as well.
Options :
t=$thresh for bits score threshold
e=$evalue_thresh for bits score threshold
r for adding ranges
m for making MSP file format output
E=Enguiry_name for specifying enquiry seq name rather than 'HMM', the default
Usage : %out=%{&open_hmmls_files(\@file)};
Version : 1.5
Example : %out = %{&open_seq_files(@ARGV)};
while @ARGV at prompt was: 'pdb_40.seq'
%seq=%{&open_seq_files(@ARGV, '1cgpa_140-197')};
to fetch 1cgbA but in range of 140-197 only
Function : open seq files and put sequences in a hash
seq sequence file format is like this;
1l94 162 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRTFRTGTWDAYK
1lye 162 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRTFRTGTWDAYK
1lyj 162 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRTFRTGTWDAYK
1mngA 203 PYPFKLPDLGYPYEALEPHIDAKTMEIHHQKHHGAYVTNLNAALEKYPYLHGVLNWDVAEEFFKKA
This can also return the sizes of sequences rather than seqs.
Keywords : open_pdbs_files
Options : any digit for the minimum seq length
Usage : %seq=%{&open_seq_files($tim_seq_file, ['MJ0084'], [15] )};
if you put additional seq name as MJ0084 it will
fetch that sequence only in the database file.
Any digit will be used as minimum seq size to be fetched.
Version : 1.6
Example :
717 0 0.343 16 373 EC1260_16-373 74 434 YBL6_YEAST_74-434
348 9e-16 0.500 113 233 EC1260_113-233 27 146 YDBG_ECOLI_27-146
472 2.9e-08 0.271 13 407 EC1260_13-407 148 567 YHJ9_YEAST_148-567
459 1.9e-22 0.260 1 407 EC1260_1-407 65 477 YLQ6_CAEEL_65-477
452 4.5e-14 0.275 1 407 EC1260_1-407 103 537 YSCPUT2_103-537
1131 0 0.433 1 407 EC1260_1-407 112 519 ZMU43082_112-519
Input SSO file example)-> below
>>MG032 ATP-dependent nuclease (addA) {Bacillus subtilis (666 aa)
Z-score: 88.3 expect() 1.9
Smith-Waterman score: 77; 27.143% identity in 70 aa overlap
30 40 50 60 70 80
MJ0497 RSAGSKGVDLIAGRKGEVLIFECKTSSKTKFYINKEDIEKLISFSEIFGGKPYLAIKFNG
: .. ... . .:.:::. :: : ..:
MG032 HDKVRYAFEVKFNIALVLSINKSNVDFDFDFILKTDNFSDIENFNEIFNRKPALQFRFYT
200 210 220 230 240 250
90 100 110 120 130
MJ0497 EMLFINPFLLSTNGK------NYVIDERIKAIAIDFYEVIGRGKQLKIDDLI
. :: :: ::. : ....... . ::. . :
MG032 K---INVHKLSFNGSDSTYIANILLQDQFNLLEIDLNKSIYALDLENAKERFDKEFVQPL
260 270 280 290 300 310
Parseable form -m 10 option =========================================
>>>MJ0497.fa, 133 aa vs GMG.fa library
; pg_name: Smith-Waterman (PGopt)
; pg_ver: 3.0 June, 1996
; pg_matrix: BL50
; pg_gap-pen: -12 -2
>>MG032 ATP-dependent nuclease (addA) {Bacillus subtilis
; sw_score: 77
; sw_z-score: 88.3
; sw_expect 1.9
; sw_ident: 0.271
; sw_overlap: 70
>MJ0497 ..
; sq_len: 133
; sq_type: p
; al_start: 58
; al_stop: 121
; al_display_start: 28
Function : This reads the parseable( -m 10 option)
and non-parseable form of ssearch program output
If you give 5 files, it produces 5 hashes as a ref of array.
This understands xxxx.gz files.
This reads FASTA -m 10 output, too.
Keywords : open_ssearch_output_files, ssearch_output, ssearch, FASTA,
Options : _ for debugging.
# for debugging.
u= for upper E value limit
l= for lower E value limit
r for attaching ranges to out seq names (eg> HI0001_1-20 as a key)
U for making the matched seqname to upppercase
L for making the matched seqname to lowercase
R for attaching ranges to out seq names for both TARGET and MATCH
n for new format (msp2)
a for getting alignments of the pair
Usage : @sso=@{&open_sso_files(@file, $add_range, $add_range2, "u=$upper_expect_limit",
"l=$lower_expect_limit", "m=$margin", $new_format)};
Version : 4.5
Warning : By default, the SW score comes to the first
If expect value is not found, it becomes '0'
By default, the offset of seq match with a seq name like seq_30-40
will be 30 not 1.
It ignores special chars like , : .prot in the name (eg, AADF_FASDF: will be AADF_FASDF)
Example : Example output(with 'n' opt):
d1bi6h1 d1bi6h1_1-24 IBR1_ANACO_20-42 IBR2_ANACO_19-42
e1bi6.1h1 IBR1_ANACO_38-52 e1bi6.1h1_1-18 IBR2_ANACO_38-52
Function : opens Erik Sonhammer's MSPcrunch file output(default).
This looks up xxxxx.fa files in the pwd (with S opt) and see
if it can get the sequences as well.
With 'n' option you can just get the matched sequence
names with ranges.
Keywords : exchange_msp_file_columns,
Options :
s -s for size return only
S -S for the sequences are fetched if equivalent xxxx.fa files are in pwd
n -n for matched seq NAMEs with ranges only (eg: HI0001_1-12,,), hash ref is out
R for NO range attachment in Name only return option (n)
e= for evalue threshhold, if e=1, ignores all which are over 1
t= for score threshhold if t=100, ignores all which are less 100
l= for match length threshold.
x for exchange query with matched seqs. eg) 12 0.09 1 30 QUERY 1 29 MATCH
becomes 12 0.09 1 30 MATCH 1 29 QUERY
This returns the same lines as input only with exchanged query and match seqs
Usage : %seq=%{&open_msp_files(@file, $names_only)};
Version : 2.8
Argument : files names like (6taa, 6taa.dssp) If you put just '6taa' without extension, it
searches if there is a '6taa.dssp' in both PWD and $DSSP env. set directory.
---------- Example of dssp ---
**** SECONDARY STRUCTURE DEFINITION BY THE PROGRAM DSSP, VERSION JUL
REFERENCE W
HEADER RIBOSOME-INACTIVATING PROTEIN 01-JUL-94 1MRG
COMPND ALPHA-MOMORCHARIN COMPLEXED WITH ADENINE
SOURCE BITTER GOURD (CUCURBITACEAE MOMORDICA CHARANTIA) SEEDS
AUTHOR Q
246 1 0 0 0 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN) .
112 95.0 ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2) .
171 69.5 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J) , SAME NUMBER PER 100 RESIDUES .
12 4.9 TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES .
36 14.6 TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES .
1 0.4 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES .
1 0.4 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES .
74 30.1 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES .
5 2.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** HISTOGRAMS OF *** .
0 0 0 0 1 1 0 2 0 0 1 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 RESIDUES PER ALPHA HELIX .
1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PARALLEL BRIDGES PER LADDER .
2 0 1 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ANTIPARALLEL BRIDGES PER LADDER .
2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LADDERS PER SHEET .
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA
1 1 D 0 0 132 0, 0.0 2,-0.3 0, 0.0 49,-0.2 0.000 360.0 360.0 360.0 153.4 44.0 96.9 -23.8
2 2 V E -a 50 0A 10 47,-1.5 49,-2.8 2, 0.0 2,-0.3 -0.889 360.0-163.3-115.9 151.4 43.1 100.4 -22.5
3 3 S E -a 51 0A 63 -2,-0.3 2,-0.3 47,-0.2 49,-0.2 -0.961 10.3-172.8-131.0 152.3 44.8 103.7 -23.4
4 4 F E -a 52 0A 8 47,-2.2 49,-2.3 -2,-0.3 2,-0.4 -0.985 6.9-161.2-143.2 139.5 45.0 107.2 -22.0
5 5 R E -a 53 0A 144 -2,-0.3 4,-0.2 47,-0.2 49,-0.2 -0.993 9.7-156.0-121.0 125.9 46.6 110.2 -23.6
6 6 L S S+ 0 0 1 47,-2.3 2,-0.5 -2,-0.4 3,-0.4 0.644 73.2 90.9 -73.3 -22.4 47.5 113.2 -21.4
7 7 S S S+ 0 0 81 47,-0.3 3,-0.1 1,-0.2 -2,-0.1 -0.695 106.0 5.2 -75.5 121.0 47.4 115.6 -24.4
8 8 G S S+ 0 0 72 -2,-0.5 -1,-0.2 1,-0.3 5,-0.1 0.269 97.6 147.8 90.2 -10.7 43.9 117.0 -24.7
9 9 A + 0 0 10 -3,-0.4 -1,-0.3 -4,-0.2 -3,-0.1 -0.256 16.8 166.8 -58.8 142.4 42.9 115.2 -21.5
(\$inputfile1, \$inputfile2, .... )};
Function : open dssp files and put sequences in a hash(s)
It can take options for specific secondary structure types. For example,
if you put an option $H in the args of the sub with the value of 'H'
open_dssp_files will only read secondary structure whenever it sees 'H'
in xxx.dssp file ignoring any other sec. str. types.
If you combine the options of 'H' and 'E', you can get only Helix and long
beta strand sections defined as segments. This is handy to get sec. str. segments
from any dssp files to compare with pdb files etc.
With 'simplify' option, you can convert only all the 'T', 'G' and 'I' sec. to
'H' and 'E'.
Options : H, S, E, T, I, G, B, P, C, -help
$H = 'H' by -H or -h or H or h # to retrieve 4-helix (alpha helical)
$S becomes 'S' by -S or -s or S or s # to retrieve Extended strand, participates in B-ladder
$E becomes 'E' by -E or -e or E or e # to retrieve residue in isolated Beta-bridge
$T becomes 'T' by -T or -t or T or t # to retrieve H-bonded turn
$I becomes 'I' by -I or -i or I or i # to retrieve 5-helix (Pi helical) segment output
$G becomes 'G' by -G or -g or G or g # to retrieve 3-helix (3-10 helical)
$B becomes 'B' by -B or -b or B or b # to retrieve only B segment
$simplify becomes 1 by -p or P or -P, p
$comm_col becomes 'c' by -c or c or C or -C or common
$HELP becomes 1 by -help # for showing help
Returns : (*out, *out2) or (@out_array_of_refs)
Usage : (*out, *out2) = @{&open_dssp_files(\$inputfile1, \$inputfile2, \$H, \$S,,,,)};
(@out) = @{&open_dssp_files(\$inputfile1, \$inputfile2, \$H, \$S,,,,)};
Version : 2.9
$debug feature has been added to make it produce error messages with '#' option.
Warning : 6taa.dssp and 6taa are regarded as the same.
Argument : (\$inputfile1, \$inputfile2, .... )};
Function : opens JPO's xxxx.tem file, stores in 5 hashes. (usually one tem file)
Options : -n, n, or N for removing any gaps in the sequences.
-s, s, or S for getting only the sequences.
Returns : ($r1, $r2, $r3, $r4, $r5) <= these are references for hashes.
Usage : ($r1, $r2, $r3, $r4, $r5)=&open_tem_files(\$infile1, \$inputfile2..)};
---------- Example of xxxx
>P1;1cdg
sequence
APDTSVSNKQNFSTDVIYQIFTDRFSDGNPANNPTGAAFDGTCTN-LRLYCGGDWQGIINKINDGYLTGMGVTAI
>P1;1cdg
secondary structure and phi angle
CCCCCCCCCCCCCCCCEEECCHHHHCCCCHHHCCCPHHCCCCPCC-CCCCCPCCHHHHHHHHHCPHHHHHPCCEE
>P1;1cdg
solvent accessibility
TTTTTTTTTTTFFFFFFFFFFFFFFTTTTTTTTTTTTTTTTTFTT-TTTTFFFFFTFFTTTFTTTFFTTFTFTFF
>P1;1cdg
DSSP
CCCCCCCCCCCCCCCCEEECCHHHHCCCCGGGCCCGGGCCCCCCC-CCCCCCCCHHHHHHHHHCCHHHHHCCCEE
>P1;1cdg
percentage accessibility
67523272360000000000000002213792129b722248085-14110000030015105660028040200
2ltn ----TETTSFLITKFSPDQQNLIFQGDGYTT-KEKLTLTK------AVKNTVGRALYSSP
1loe ----TETTSFSITKFGPDQQNLIFQGDGYTT-KERLTLTK------AVRNTVGRALYSSP
2ltn ----CEEEEEEECCCCCCCCCEEEEPCCEEP-PPCEEEEC------CCCPCEEEEEECCC
1loe ----CEEEEEEECCCCCCCCCEEEEPCCEEE-PPEEEEEC------CCCPCEEEEEECCC
2ltn ----TTTTTTTTTTFTTTTTTFTTTTTFTFT-TTTFTFFT------TTTTTTFFFFTTTT
1loe ----TTTTTTTTTTFTTTTTTFTTTTTFTFT-TTTFFFFT------TTTTTTFFFFTTTT
2ltn ----CEEEEEEECCCCCCCCCEEEEECCEEC-CCCEEEEC------CCCCCEEEEEECCC
1loe ----CEEEEEEECCCCCCCCCEEEEECCEEE-CCEEEEEC------CCCCCEEEEEECCC
2ltn ----543251b16504681c50422650502-75201006------35681200001453
1loe ----6532e1508a07981b50422750404-8a200006------36672200001453
Version : 1.0
Function :
Example of hlx file (For Bo Nielson)
Residue Frame Score Probability
1 M a 1.00563E+00 2.05479E-03
2 T b 1.01814E+00 2.52053E-03
3 R c 1.01814E+00 2.52053E-03
Returns : list of ref. for hash(es)
Function : reads jp files and stores results in a hash.
Returns : a reference of a hash for names and their sequences.
Usage : %out_hash=%{&open_jp_files(\$file_name)};
Version : 1.1
Warning : All the spaces '-' !!!
Function : open fasta files and put sequences in a hash
FASTA sequence file format is like this;
>P1;1abp
structureX:1abp: 1 : : 306 : :L-arabinose-binding protein:Escherichia coli: 2.40:-1.00
ENLKLGFLVKQPEEPWFQTEWKFADKAGKDLG-FEVIKIAV-PDGEKTLNAIDSLAASGAKGFVICTPDPKLGSA
TEGQGFKAADIIGIGINGVDAVSELSKAQATGFYGSLLPSPDVHGYKSSEMLYNWVAK--------DVEPPKFTE
VTDVVLITRDNFKEELEKKGLGGK*
>P1;2gbp
structureX:2gbp: 1 : : 309 : :D-galactose/D-glucose-bind:Escherichia coli: 1.90:14.60
ADTRIGVTIYKYDDNFMSVVRKAIEQDAKAAPDVQLLMNDSQNDQSKQNDQIDVLLAKGVKALAINLVDPAAAGT
LKAHNKS-SIP-VFGVDA--LPEALALVKSGALAGTVLNDANNQAKATFDLAKNLADGKGAADGTNWKIDNKVVR
VP-YVGVDKDNLAEFSKK------*
Usage : %anyhash = %{&open_ali_files(\$filename)};
Function : open fasta files and put sequences in a hash
FASTA sequence file format is like this;
>P1;1abp
structureX:1abp: 1 : : 306 : :L-arabinose-binding protein:Escherichia coli: 2.40:-1.00
ENLKLGFLVKQPEEPWFQTEWKFADKAGKDLG-FEVIKIAV-PDGEKTLNAIDSLAASGAKGFVICTPDPKLGSA
VTDVVLITRDNFKEELEKKGLGGK*
>P1;2gbp
structureX:2gbp: 1 : : 309 : :D-galactose/D-glucose-bind:Escherichia coli: 1.90:14.60
LKAHNKS-SIP-VFGVDA--LPEALALVKSGALAGTVLNDANNQAKATFDLAKNLADGKGAADGTNWKIDNKVVR
VP-YVGVDKDNLAEFSKK------*
Usage : %anyhash = &open_pir_files($any_sequence_file_fasta_form);
Version : 1.2
Function : reads CLUSTALW aln files and stores results in a hash.
Returns : a reference of a hash for names and their sequences.
Usage : %out_hash=%{&open_aln_files(\$file_name)};
Version : 1.1
Argument : (\$inputfile1, \$inputfile2, .... )};
Function : open various sequence alignment files and put sequences in a hash(s)
Returns : (*out, *out2) or (@out_array_of_refs)
Usage : (*out, *out2) = @{&open_seq_alignment_files(\$inputfile1, \$inputfile2)};
: %hash_seq = %{&open_seq_alignment_files(\$inputfile1)};
: (@out) = @{&open_seq_alignment_files(\$inputfile1, \$inputfile2)};
Version : 1.0
Argument : a ref. for scaler of "jp file name"
Example : jp file == seq1 ABDSF--DSFSDFS <- true sequence
seq2 lkdf-jlsjlsjf
sst files == seq1.sst, seq2.sst
output hash == seq1 hHHHHHHHttEEEEEEEE
seq2 hHHHHHHHHHEEEEEEhh
Function : gets the name of a file(jp file) with its absolute dir path
reads the sequence names in the jp file and looks up all
the sst files in the same directory. Puts sst sequences
in a hash with keys of sequence names.
Returns : a ref. for a hash
Usage : %out_sst_hash =%{&open_sst_files(\$jp_file_dir_and_name)};
Warning : $jp_file_dir_and_name should be absolute dir and file name
Argument : a ref. for scaler of "jp file name"
Example : jp file == seq1 ABDSF--DSFSDFS <- true sequence
seq2 lkdf-jlsjlsjf
sst files == seq1.sst, seq2.sst
output hash == seq1 hHHHHHHHttEEEEEEEE
seq2 hHHHHHHHHHEEEEEEhh
Function : gets the name of a file(jp file) with its absolute dir path
reads the sequence names in the jp file and looks up all
the sst files in the same directory. Puts sst sequences
in a hash with keys of sequence names.
Returns : a ref. for a hash
Usage : %out_sst_hash =%{&read_sst_files(\$jp_file_dir_and_name)};
Warning : $jp_file_dir_and_name should be absolute dir and file name
Argument : takes one ref. for a file.
Example : selex file (foo.slx) looks like this:
#=SQ GLB_TUBTU 5.9393 - - 0..0::0 -
#=SQ GGZLB 20.9706 - - 0..0::0 -
#=RF x.....x.xxxx.xxx.xxxxxx....xxxxxxxxxxxxxxx.xxxx
HAHU ......VLSPADKTNVKAAWGKVGA......HAGEYGAEALERMFLS
HBA3_PANTR ......VLSPADKTNVKAAWGKVGA......HAGZYGAEALERMFLS
Function : open slx files and put sequences in a hash
Returns : a ref. of a hash
Usage : %anyarray = &open_slx_files(\$any_sequence_file_slx_form);
Version : 1.0
Warning : The slx FORMAT SHOULD BE AT LEAST 30 residue long
Argument : takes one ref. for a file.
>>Out file looks like this===>
3aat mfe aapadp----adlfraderpGk gigvY--etgktpvltS
1ama sswwshvemgppdp krdtns--kkMnLG---YrddngkpyvLnC-
Function : open out files and put their sequences in a hash
Returns : a ref. of a hash
Output example in a hash(fills the space)
3aat --mfe---aapadp----adlfraderpGk---gigvY--etgktpvltS
1ama ---eamiaakkmdkeylpiaGladFtraSA----eAfksgryVTV
Usage : %anyarray = &open_out_files(\$any_out_file);
Warning : well tested. It skips lines starting with blank, lines with '-' in them.
Function : gets number of days between two dates ( "05/15/94" )
Usage : $output = &diff_dates("05/15/1994", "05/15/1995")
Version : 1.0
Warning : modified (originally from reb@serf.nsc.com (Edward Brown))
Example : print &fromJulian(34469), "\n";
Function : taking the days between two dates.
Version : 1.0
Warning : got from reb@serf.nsc.com (Edward Brown)
require "julian
$Value1 = &toJulian("05/15/1994"); # Assign $Value1 a Julian Day
print "$Value1\n";
$Value2 = &toJulian("05/20/1994"); # Assign Value2 a Julian Day
print "$Value2\n";
$Days = $Value2 - $Value1; #Difference in Days
print "$Days\n";
print &fromJulian(34469), "\n"; # Give a Julian Day, give the date
print &fromJulian(34474), "\n";
What is the Date 25 Days from Today? (You can get format from `date`)
$Value = &toJulian("05/16/1995");
$Value += 25;
print &fromJulian($Value), "\n";
Example : $Value1 = &toJulian("05/15/94"); print "$Value1\n";
Function : taking the days between two dates.
Version : 1.0
Warning : got from reb@serf.nsc.com (Edward Brown)
Example : as in my 'indexing.pl' for perl file indexer.
Function : open dir and process all files if you wish, and then go in any sub
dir of it. Using recursion. created by A Biomatic
if any file is linked, it skips that file.
Usage : &opendir_and_go($input_dir); #$inputdir='/nfs/ind4/ccpe1/people/A Biomatic /jpo/align';
Version : 1.0
Warning : Seems to work fine.
Function : this is for sort, to sort things according to the higher num. of occu.
Usage : sort occurances (@any_array_with_repeating_element);
Version : 1.0
Warning : This is from 21 DAYS book, page 373.
Function : extract seqs. which are from struc. alignment only. to be analysed.
after mul. alignment with added seq. you can extract original str.
sequ. by using this. The output always has ...msff ext.
*array_ali is the JPO's or true alignment hash.
Usage : &extract_ori_seq($input_file, $output_file, $out_seq_no, *array2);
Version : 1.0
Function : get pair wise seq. identity of any two strings, outputs a scalar (%)
Usage : $homology_out = ${&get_pair_homol(\@any_array_of_2_elem)};= @ar=(ABCDE..., CDEGA..)
Version : 1.0
Warning : reliable, but input seq. strings shouldn't contain spaces.
Function : get pair wise seq. identity(%) of any two strings put in as a hash
Usage : $homology_out = &get_pair_homol_hash(%any_hash); , eg) %hash = (name1, ABCDE..., name2, CDEGA..)
Version : 1.0
Warning : reliable, but input seq. strings shouldn't contain spaces.
Function : returns the size of any single testing file
Usage : $outputfilesize = &file_size($input_file_name);
Version : 1.0
Warning : Q is for quality of this sub. This can't be wrong.
Function : get string seq COMPOSITION identities(a to z). gets array
of strings and outs array of % numbers
Usage : @outarray = &seq_comp_percent2(@any_input_string_array);
Version : 1.0
Example : returns 'jong' with the input of '/nfs/ind5/A Biomatic '
Function : returns present working dir name
Usage : $dir = &pwd_dir($any_absolute_path_dir);
Version : 1.0
Warning : well tested.
Example : with 'jong' it gives '/nfs/ind5/jong', '/nfs/ind4/ccep1/people/A Biomatic '...
when 'jong' is in /nfs/ind4/jong/Perl, it returns /nfs/ind4/A Biomatic
Function : returns full path dir names with given short dir names.
Usage : @full_path_dirs = @{&get_full_path_dir_names(@short_dir_name)};
Version : 1.0
Warning : when 'jong' is in /nfs/ind4/jong/Perl, it returns /nfs/ind4/A Biomatic
Keywords : get_file_extension, get_extension, get_file_ext, get_ext_names
get_file_extensions
Options : _ for debugging.
# for debugging.
Usage : @ext=@{&get_file_extensions(\@file)} or
$ext=${&get_file_extensions(\$file)}
Version : 1.2
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Keywords : get_file_extension, get_extension, get_file_ext, get_ext_names
get_extension_names
Options : _ for debugging.
# for debugging.
Usage : @ext=@{&get_file_extensions(\@file)} or
$ext=${&get_file_extensions(\$file)}
Version : 1.2
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Argument : handles both ref and non-ref.
Example : $base => 'test' with 'test.txt' or '/home/dir/of/mine/text.txt'
Function : produces the file base name(eg, "evalign" out of "evalign.pl" ).
when xxxx.xx.gz form file is given, it removes gz as well
Keywords : get_base_name{, base_name, file_base_name , get_file_base_name
get_basename, basename, get_root_name, base , root, get_file_root
Usage : $base =${&get_base_names(\$file_name)};
: or @bases = &get_base_names(\@files); # <-- uses `pwd` for abs directory
Version : 1.5
Example : @all_files=@{&read_file_names_only(\$abs_path_dir_name, ..)};
@all_files=@{&read_file_names_only(\$dir1, '.pl', '.txt')};
@all_files=@{&read_file_names_only(\$dir1, '.', \$dir2, \$dir3, 'e=pl')};
@all_files=@{&read_file_names_only(\$abs_path_dir_name, 'G1_*.txt')};
@all_files=@{&read_file_names_only(\$abs_path_dir_name, \@target_file_names)};
Function : read any file names and REMOVES the '.', '..' and dir entries.
And then put in array. This checks if anything is a real file.
You can use 'txt' as well as '.txt' as extension
You can put multiple file extension (txt, doc, ....)
and multiple dir path (/usr/Perl, /usr/local/Perl....)
It will fetch all files wanted in all the direc specified
It can handle file glob eg)
@all_files=@{&read_file_names_only(\$abs_path_dir_name, 'G1_*.txt')};
for all txt files starting with 'G1_'
Keywords : filename only, filename_only, read_files_only, read files
get_file_names_only, get_files_only, read_files_only
Options : "extension name". If you put , 'pl' as an option, it will show
files only with '.pl' extension.
'-p' for path also included resulting in '/path/path/file.ext'
rather than 'file.ext' in output @array
'-s' for sorting the results
e='xxx' for extention xxx
'.pl' for files extended by '.pl'
'pl' for files extended by 'pl', same as above
D= for dir name input
Usage : @all_files=@{&read_file_names_only(, [extension])};
Version : 2.8
Warning : This does not report '.', '..'
Only file names are reported. Compare with &read_any_dir
extension size should be less than 15 char.
It sorts the results!
Function : reads only extension names. It returns the ext as keys
and occurrances of them as values of the keys.
Keywords : read_file_ext_only, read_file_ext_names_only, read_ext_names_only,
read_ext_only
Usage : %file_ext=%{&read_file_extension_names_only('.')};
Version : 1.1
Argument : takes one or more scaler references. ('.', \$path, $path, ... )
Example : @files=@{&read_dir_names_only('n', "s=1", '.')};
Function : read any dir names and and then put in array. If no argument
for the target directory, it opens PWD automatically
You can specify the length of dir names to choose.
Keywords : read_dir_only, get_dir_names, get_dir_names_only, get_subdir_names,
Options : n for names only reading(not the full path) , default is full path
s= for the size of dirs name. If you want all the dir names
with a size of 1 char, s=1
Returns : one ref. of array.
Usage : @all_dirs_list = @{&read_dir_names_only(\$absolute_path_dir_name, ....)};
Version : 3.4
Warning : This does not report '.', '..'
Only file names are reported. Compare with &read_any_dir
Example : will return file.name from /dir/dir/file.name
Function : takes file name portion from long dir/filename
Keywords : get_file_name_only, extract_file_name, take_file_name_only
Options : _ for debugging.
# for debugging.
Usage : $base_portion =${&take_file_name(\'/dir/file.name')};
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Example : /dir/file.name
=> /dir/
Function : returns the dir portion of long filename.
If file does not have dir portion it returns './'
Keywords : get_file_dir_name, take_file_dir_name, take_file_dir_names
Options : _ for debugging.
# for debugging.
Version : 1.0
Warning : You MUST NOT delete '# options : ..' entry
as it is read by various subroutines.
Argument : takes one or more scaler references. ('.', \$path, $path, ... )
Function : read any dir names and and then put in array.
Returns : one ref. of array.
Usage : @all_dirs_list = @{&get_dir_names_only(\$absolute_path_dir_name, ....)};
Version : 3.0
Warning : This does not report '.', '..'
Only file names are reported. Compare with &read_any_dir
Argument : takes one scaler reference.
Function : read any dir and REMOVES the '.' and '..' entries. And then put in array.
Returns : one ref. of array.
Usage : @file_list = @{&read_any_dir(\$absolute_path_dir_name)};
Version : 1.1
Argument : takes one scaler reference.
Function : read any dir and REMOVES the '.' and '..' entries.
And then put in array.
Returns : one ref. of array. for the files in the given directory.
Usage : @file_list = @{&read_any_dir(\$absolute_path_dir_name)};
Version : 1.2
Argument : takes one or more scaler references.
Function : read any dir and REMOVES the '.' and '..' entries. And then put in array.
Returns : one ref. of array.
Usage : @file_list = @{&read_any_dir(\$absolute_path_dir_name, ....)};
Version : 1.0
Warning : This does not report '.', '..', '#xxxx', ',xxxx', etc. only legitimate
file and dir names are reported.
Function : gets the largest 'string' length in values of any one hash
Usage : $largest_str_length_of_values = &max_value_hash(%any_hash);
Version : 1.0
Function : gets the largest 'string' length in values of any one hash
Keywords : get_max_hash_value, get_largest_hash_value, get_max_hash_key_value
get_max_hash_num_value, max_hash_value
Usage : $largest_str_length_of_values = &max_value_hash(%any_hash);
Version : 1.1
Function : gets the largest 'string' length in keys of any one hash
Keywords : largest key length,
Usage : $largest_str_length_of_values = &max_value_hash(%any_hash);
Version : 1.0
Function : gets the smallest 'string' length in values of any one hash
Usage : $small_str_length_of_values = &min_str_value_hash(%any_hash);
Version : 1.0
Function : gets the smallest 'string' length in values of any one hash
Usage : $small_str_length_of_values = &min_str_value_hash(%any_hash);
Version : 1.0
Function : prints fasta format output which is using $mul_factor
$seq is the whole sequence number(largest).
$dir.$mul_factor.fasta can be any output name,
Usage : &fasta_output($dir.$mul_factor.fasta, $whole_seq, *array_ali, *array1);
Version : 1.0
Function : prints fasta format output with specified seq no from whole seq. no.
$seq is the whole sequence number(largest). $out_seq_no is the target
Usage : &fasta_out_seq_no($dir, $out_seq_no, $seq, *array2, *array1);
Version : 1.0
Example : 30-Nov-1995
Function : returns date: $date6d (6 digit format) and
$datec (dd-mmm-yyyy format), Tim's version is 'getdate' in th_lib.pl
Keywords : get_present_date,
Returns : ref of an array for (1-May-1995 and 010595)
Usage : @outformat = &get_date; eg result > (010595 1-May-1995)
Version : 1.1
Function : checks the date of last modi of file given and compares with
present time. Substracts diff and returns the actual diff days.
Keywords : how_old_file, how_old, is_file_older_than_x_days, file_age,
file_age_in_days, if_older_than_x_days,
Returns : the actual days older, so NON-ZERO, otherwise, 0
Usage : if( ${&if_file_older_than_x_days($ARGV[0], $days)} > 0){
Version : 1.3
Argument : gets on ref. of array.
Example : This is used only with subs which accepts array inputs.
Function : checks if any inputting array is empty or with one element.
Keywords : array_check
Returns : nothing, prints out messages to STDOUT
Usage : &array_chk(\@any_array_to_chk);
Version : 1.0
Argument : one ref. of an array
Function : get string seq identities(a to z). gets array of strings and outs array of % numbers
Returns : one ref. of an array
Usage : @outarray = &seq_comp_percent1(@any_input_string_array);
Function : gets the % id of any two sequences, returns in 100.0% format.
Usage : $id = &get_id_among_2(*charcount1, *charcount2) <- hashes
Version : 1.0
Argument : takes one array reference.
Function : (the same as average_array)
Keywords : get_array_average, av_array, average_array, get_average_array
average_of_array, average_array
Returns : single scaler digit.
Usage : $output = &array_average(\@any_array);
Version : 1.2
Warning : If divided by 0, it will automatically replace it with 1
Argument : takes one array reference.
Function : (the same as array_average)
Returns : single scaler digit.
Usage : $output = &average_array(\@any_array);
Version : 1.0
Warning : If divided by 0, it will automatically replace it with 1
Argument : takes one array reference.
Options : -int to make the resultant numbers shown in integer
Returns : single scaler digit.
Usage : $output = &average_of_array(\@any_array);
Version : 2.0
Warning : If divided by 0, it will automatically replace it with 1
'$item == 0' does not work !!! in the following
Example : %in=(1, "13242442", 2, "92479270", 3, "2472937439");
Returns : %out =(1, 2.13242, 2, 5.2702, 3, 1.72937439); <-- somethins like
numbers. So, undefined array element is not counted
This is more correct.
Usage : %out=%{&hash_average(\%in)}; or
($out1, $out2)=&hash_average(\%in,\%in2);
Version : 1.0
Keywords : get_values_average, get_average_hash_value, get_average_value
Returns : %out =(1, 2.13242, 2, 5.2702, 3, 1.72937439); <-- somethins like
numbers. So, undefined array element is not counted
This is more correct.
Usage : %out=%{&get_hash_value_average(\%in)}; or
($out1, $out2)=&hash_average(\%in,\%in2);
Version : 1.0
Example : %in =(1, "13242442", 2, "92479270", 3, "2472937439");
%in2=(1, "28472", 2, "23423240", 3, "123412342423439");
%in =(name1, "1,3,2,4,2,4,4,2", name2, "9,2,4,7,9,2,7,0");
Function : gets the min, max, av, sum for the whole values of ALL the
hashes put in. (grand statistics)
Returns : normal array of ($min, $max, $sum, $av)
Example out:> | min max sum av
-----------------------------------
of the whole | 0 9 110 6
Usage : %out=%{&hash_average(\%in, \%in2,..)};
Version : 1.0
Function : accepts ref of array, scalar and normal digits to
find the min. Only gets numbers. If you put something
like 'H333333', it gets digits '333333' only and returns it.
this uses RECURSION.
Usage : $min = &min (37, 24, 3,1,5, \@array, @array2, \$arr_ref);
Version : 1.0
Function : accepts ref of array, scalar and normal digits to
find the min. Only gets numbers. If you put something
like 'H333333', it gets digits '333333' only and returns it.
this uses RECURSION.
Usage : $max = &max (37, 24, 3,1,5, \@array, @array2, \$arr_ref);
Version : 1.0
Argument : gets one reference of an array of strings.
Function : get_longest_str_size in an array. eg. get ABCDE among (A, CAB, CDE, ABCDE)
When hash is given it processes the values of it.
Keywords : get_the_largest_string_size{, get_largest_string_size,
get_largest_str_size{,largest_string_size{, get_largest_string_size_hash
get_long_str_size, get_longest_string_size, lonest_string_size
Usage : $long_str_size = ${&get_long_str_size (\@any_array_of_string)};
$long_str_size = ${&get_long_str_size (\@any_array_of_string)};
Version : 1.2
Argument : gets one reference of an array of strings.
Function : get_shortest_str_size in an array. eg. get A among (A, CAB, CDE, ABCDE)
Keywords : get_short_str_size, get_short_string_size, shortest_string_size,
Usage : $short_str_size = &get_short_str_size (\@any_array_of_string);
Version : 1.0
Warning : once debugged. 1st May/95
Argument : gets two references of hashes of chars and their occurances.
Example : %hash1=('A', 30, 'B', 99, 'C', 15 .....)
Function : gets the % id of any two sequences
Usage : $id = &get_id_among_2(\%charcount1, \%charcount2) <- hashes
Version : 1.0
Function : extract only numbers(including negatives) from a string and put into an array
Usage : @my_outarray = &extract_num_to_array($any_input_string);
Version : 1.0
Argument : ref. of an array of numbers.
Function : sum of all the elements of an array .
Keywords : get_array_sum get_sum_array, get sum of array
Returns : a ref. of a scaler.
Usage : $out = ${&sum_array(\@anyarray)};
Argument : ref. of an array of numbers.
Function : sum of all the elements of an array .
Options : -int for integerised output.
Returns : a ref. of a scaler.
Usage : $out = ${&sum_of_array(\@anyarray)};
Version : 1.0
Argument : ref. of an array of numbers.
Function : sum of all the elements of an array .
Returns : a ref. of a scaler.
Usage : $out = ${&sum_array(\@anyarray)};
Version : 1.0
Example : %hashinput= ( name1, '12..3e',
name2, '...234');
$result = 1+2+3+2+3+4 = 15 (from above example)
Function : sum of all the numbers in valuse of a hash
Keywords : sum_hash_string_values, get_sum_hash_string_values, get_hash_value_sum
Usage : $out = &sum_hash_values_of_string(\%anyhash);
Version : 1.1
Warning : It only gets digits in the input strings and sums them up.
Example : %hashinput= ( name1, '12..3e',
name2, '...234');
$result = 1+2+3+2+3+4 = 15 (from above example)
Function : sum of all the numbers in valuse of a hash
Keywords : sum_hash_number_values, get_sum_hash_values, get_hash_value_sum
Usage : $out = &sum_hash_values(%anyhash);
Version : 1.0
Warning : It only gets digits in the input strings and sums them up.
Function : detects keyboard input without reading it
Returns :
You should check out the Frequently Asked Questions list in
comp.unix.* for things like this: the answer is
essentially the same.
It's very system dependent. Here's one solution that
works on BSD systems:
Version : 1.0
Function : gives rounded integer numbers. 9.5 will be 10, 9.4 will be 9
Usage : @output=@{&round_numbers(\@input_numbs)};
or $output=${&round_numbers(\$input_numbs)};
Version : 1.0
Example : given num array( 1.33333, 3.555242424, 0.2342324, 4.9234723747)
>>> (1.33, 3.56, 0.23, 4.92 )
Function : gives trimmed numbers (not rounded)
Usage : @output=@{&trim_numbers(\@input_numbs, \$size_of_posi)};
Version : 1.0
Warning : If you put '1' with trimming value of 2 it will be '1.00'
Argument : numerical arrays
Function : gets the smallest element of any array of numbers.
Returns : one or more ref. for scalar numbers.
Usage : ($out1, $out2)=@{&min_elem_array(\@array1, \@array2)};
($out1) =${&min_elem_array(\@array1) };
Version : 1.0
Argument : numerical arrays
Function : gets the largest element of any array of numbers.
Returns : one or more ref. for scalar numbers.
Usage : ($out1, $out2)=@{&max_elem_array(\@array1, \@array2)};
($out1) =${&max_elem_array(\@array1) };
Version : 1.0
Argument : numerical arrays
Function : gets the largest string length of element of any array of numbers.
Keywords : largest string length of array
Returns : one or more ref. for scalar numbers.
Usage : ($out1, $out2)=@{&max_elem_array(\@array1, \@array2)};
($out1) =${&max_elem_array(\@array1) };
Version : 1.0
Argument : numerical arrays
Function : gets the largest string length of element of any array of numbers.
Keywords : shortest string length of array
Returns : one or more ref. for scalar numbers.
Usage : ($out1, $out2)=@{&max_elem_array(\@array1, \@array2)};
($out1) =${&max_elem_array(\@array1) };
Version : 1.0
Function : If strings are given, it gets the largest string elem(by leng)
If numbers are given, it gets the largest number elem
It automatically checks if string is given
Keywords : get_largest_value, get_biggest_value,
get_maximum_element, get_largest_number,
get_largest_number_element, get_longest_element,
get_longest_string
Options : _ for debugging.
# for debugging.
s for string input (as the second input argument!)
Usage : $max=${&get_largest_element(\@array_input)};
Version : 1.3
Function : sums up multiplied items of two arrays .
one to one multiplication(elem 1 of array 1 x elem 1 of array2)
Usage : $out = &sum_x_mul_y_arrays(*array1,*array2);
Version : 1.1
Function : synonmym of corelation_coefficient
Usage : $cc = &cc(\@array_not_hash1, \@array_not_hash2);
Version : 1.0
Warning : uses references for ARRAY
Argument : array references are accepted. outputs scalar single val.
Keywords : standard deviation, get_standard_deviation,
standard_deviation, get_SD, get_sd, stdev
Returns : a ref. of a scaler
Usage : $sd=${&sd(\@array_of_numbers)};
Version : 1.2
Argument : ref. for an array.
Function : gets standard error of any given array
Keywords : standard error, get_standard_error, sterr
Usage : $se=${&se(\@array_of_numbers)};
Version : 1.1
Function :
outs line numbers with lines
Usage : To randomize th_lib.pl just type &random_lines(300,500,"th_lib.pl");
&random_lines(300, 50, "th_lib.pl"); <-- to get 300 lines
from 50 numbers
Version : 1.0
Example : in signature roation or FVWM rc file menu color rotation.
Function : randomly pick any num of pairs of hash elements.
outs line numbers with lines
Default pick number is 1.
Keywords : choose_random_hash_pairs
Returns : ARRAY ref not HASH ref
Usage : @array = @{&pick_random_hash_pairs(\%hash1, \$xx)};
Version : 1.3
Example : @array=@{&pick_random_files(\@files, \$num_of_pick)};
Function : randomly pick any num of files given.
Keywords : choose_random_files pick_files_randomly
Returns : ARRAY ref not HASH ref
Usage : @array = @{&pick_random_files(\@files, \$num_of_pick)};
Version : 1.0
Example : Following will produce (A K C);
@array1= qw( A B K B B C);
@array2= qw( B E D);
@subs = @{&substract_array(\@array1, \@array2)};
Function : removes any occurances of certain elem. of the first
input array with second input array.
Keywords : array_subtract, substract_array, ary1_minus_ary2
Usage : @subs = @{&substract_array(\@array1, \@array2)};
Version : 1.6
Function : superpose hash keys and values to another hash. %target
is the superposing hash(new ones will have the values of
this target hash. For example, if you superpose
(1, 123, 2, 343)
to (1, 111, 2, 2222, 3, 3333), you will get
(1, 123, 2, 343, 3, 3333) as the result.
Template provide blank key entries.
Usage : %output = %{superpose_hash(\%template, \%target));
Version : 1.0
Argument : accepts only two references of hashes
Example : %hashout= %hash1 - %hash2, ==> (4,4)=(2,2, 4,4) - (2,2)
Returns : a ref of a hash.
Usage : %output = &hash_common($ref1, $ref2);
Version : 1.0
Warning : NOT working
Argument : one or more hash ref.
Example : If %input was
(1,1, 2,1, 3,1);
The values are the same, so the last key value (3 1) will
be the result.
If %input was
(1,1, 2,1, 3,1, 4,2, 5,2)
result=(3 1, 5 2)
Function : removes the duplicate values of any hashes
Keywords : remove_dupplicate_values_in_hash, remove_duplicate_values,
remov_hash_dup, remove_duplication_in_hash
Returns : one or more hash ref.
Usage : %out=%{&remove_dup_in_hash(\%input_hash)};
Version : 1.0
Argument : one or more hash ref.
Function : exchanges the value and key of any hashes
Keywords : invert_hash, inverse_hash
Returns : one or more hash ref.
Usage : %out=%{&reverse_hash(\%input_hash)};
Version : 1.0
Warning : Takes ALIGNED sequences.
Returns : the VALUES OF THE FIRST HASH which occur in later hashes
are returned
Usage : %hash1_value = %{&hash_common(\%hash1, \%hash2,...)};
Version : 1.0
Returns : the VALUES OF THE FIRST HASH which occur in later hashes
are returned
Usage : %hash1_value = %{&hash_common_by_keys(\%hash1, \%hash2,...)};
Version : 1.0
Example : %hashout= %hash1 - %hash2, ==> (4,4)=(2,2, 4,4) - (2,2)
Function : removes overlapping entries in hashes.
Usage : %output = &hash_catenate(*hash1, *hash2);
Version : 1.0
Warning : surely working, This grep version is faster than for and defined loop.
Argument : SCALAR or ARRAY refs. and delimitor ('/', '.', '-'.....)
delimitor can be multi line => '#$%/=.'
default delimiter is space ' ';
Example : @new_lines=shift_word_recursively(\@lines, '/-', 2); to chop lines
off two words with the two delimiters of '/' and '-'.
/jong1/perl-jong2/perl-jong3 will become /perl-jong2/perl-A Biomatic 3
/bin/-kkk/-jjj/-jj will become /-kkk/-jjj/-jj
@out=@{&shift_word_recursively($testline, '/-', 2)};
You can use perl regexp patterns for $delimiter as it is directly
used in a pattern matching in the sub. So, you canuse '\W'
Function : shift lines word by word. This needs delimiter like '/' or '.'
and stores the resulting arrays. This is to get all the possible
directories.
For example, with /nfs/A Biomatic /perl/temp/here input, you get
( /A Biomatic /perl/temp/here, /perl/temp/here ,
temp/here, /here, ) in an array.
Usage : @new_lines=shift_word_recursively(\@lines, '/'); or
@new_lines=shift_word_recursively(\@lines, '\W'); or
@new_lines=shift_word_recursively(\@lines, 'a-zA-Z'); or
@new_lines=shift_word_recursively(\@lines, '/', 2); <--- for multiple chop unit
or $new_line = shift_word_recursively(\$line, '.'); <--- for scalar input.
Version : 1.0
Argument : SCALAR or ARRAY refs. and delimitor ('/', '.', '-'.....)
delimitor can be multi line => '#$%/=.'
default delimiter is space ' ';
Example : @new_lines=shift_word(\@lines, '/-', 2); to shift off lines two words
with the two delimiters of '/' and '-'.
/jong1/perl-jong2/perl-jong3 will become /jong1/perl-A Biomatic 2
/bin/-kkk/-jjj/-jj will become /jong1/perl-A Biomatic 2 by
@out=@{&shift_word($testline, '/-', 2)};
You can use perl regexp patterns for $delimiter as it is directly
used in a pattern matching in the sub. So, you canuse '\W'
Function : shift lines word by word. This needs delimiter like '/' or '.'
Usage : @new_lines=shift_word(\@lines, '/'); or
@new_lines=shift_word(\@lines, '\W'); or
@new_lines=shift_word(\@lines, 'a-zA-Z'); or
@new_lines=shift_word(\@lines, '/', 2); <--- for multiple chop unit
or $new_line = shift_word(\$line, '.'); <--- for scalar input.
Version : 1.0
Argument : SCALAR or ARRAY refs. and delimitor ('/', '.', '-'.....)
delimitor can be multi line => '#$%/=.'
default delimiter is space ' ';
Example : @new_lines=chop_word(\@lines, '/-', 2); to chop off lines two words
with the two delimiters of '/' and '-'.
/jong1/perl-jong2/perl-jong3 will become /jong1/perl-A Biomatic 2
/bin/-kkk/-jjj/-jj will become /jong1/perl-A Biomatic 2 by
@out=@{&chop_word($testline, '/-', 2)};
You can use perl regexp patterns for $delimiter as it is directly
used in a pattern matching in the sub. So, you canuse '\W'
Function : chop lines word by word. This needs delimiter like '/' or '.'
Keywords : chop_word_recursively, remove_word, chop_word_one_by_one
Options : -w, w, Word, etc, for getting the chopped off word(s) rather
than the original lines minus the word.
Usage : @new_lines=chop_word(\@lines, '/'); or
@new_lines=chop_word(\@lines, '\W'); or
@new_lines=chop_word(\@lines, 'a-zA-Z'); or
@new_lines=chop_word(\@lines, '/', 2); <--- for multiple chop unit
or $new_line = chop_word(\$line, '.'); <--- for scalar input.
Version : 2.0
Warning : The returning value is not the chopped off word.
Argument : two references. The first should be an array ref. The 2nd can be either
scalar or array reference.
Function : returns ref. of an array for a list of non-repetitive entry.
Returns : a ref. of an array.
Usage : @out=@{&push_if_not_already(@mother_array, @adding_array )};
@out=@{&push_if_not_already(@mother_array, $adding_scalar)};
Version : 1.0
Function : insert lines anywhere in any txt files. Without any
position options(Before, After), it attaches the line
Keywords : insert_text, insert_lines, insert_something,
attach_lines_in_text, attach_lines, insert_text_lines
Options :
$adding_line= by a=
$pattern_match_line= by p=
$option_before_or_after= by o=
Usage : &insert_lines_anywhere(\@files, \$inst_str,'after', \@match_str);
Version : 1.4
Warning : Case Insensitive by default.
Argument : NONE
Example : my(@default_env_dirs) = @{&get_all_dirs_from_ENV}; in handle_arguments
Function : extracts all the directories from %ENV setting.
Options : None
Returns : a ref. of an array of directories.
Usage : my(@default_env_dirs) = @{&get_all_dirs_from_ENV};
Version : 1.0
Warning : produces repetitive pathes (ie, can output identical path several times)
Argument : NONE
Example : my(@default_env_dirs) = @{&get_path_dirs_from_ENV}; in handle_arguments
Function : extracts path directories from %ENV setting.
Options : None
Returns : a ref. of an array of directories.
Usage : my(@default_env_dirs) = @{&get_path_dirs_from_ENV};
Version : 1.0
Warning : Replaces '.' to $pwd.
Argument : one single ref. (\@input_args);
Function : Sub argument handling for opening files with options. General
form of 'handle_arguments_xxxx', while xxxx can be files, hashes, arrays,,,,
Options : None yet, extendable by adding refs. of something.
Returns : an array of refs for file names, hashes, arrays and the opion string
Usage : my(@in)=&handle_arguments_old(\@input_args); Do not dereference it.
Version : 1.0
Argument : 2 references of file name or 2 file names.
Author : Larry Wall, Jong
Example : mv("mv.pl", *STDOUT); # This will print mv.pl contents to your screen.
Function : moves files fast, replacement of 'system("mv xxx xxxx"); '
Keywords : move files fast. mv_file, mv_files, move_files, move_file
Usage : &mv( \$srcFile, \$dstFile); or &mv( $srcFile, $dstFile);
or &mv(FILEHANDLE1, FILEHANDLE2), or &mv(FILEHANDLE1, $output)
Version : 1.4
Warning : 27 times slower than 'mv' at prompt. using system is 32 times slower
Argument : 2 references of file name or 2 file names.
Author : Larry Wall, Jong
Example : cp("cp.pl", *STDOUT); # This will print cp.pl contents to your screen.
Function : copies files fast, replacement of 'system("cp xxx xxxx"); '
Keywords : copy files fast. cp_file, cp_files, copy_files, copy_file
Usage : &cp( \$srcFile, \$dstFile); or &cp( $srcFile, $dstFile);
or &cp(FILEHANDLE1, FILEHANDLE2), or &cp(FILEHANDLE1, $output)
Version : 1.4
Warning : 27 times slower than 'cp' at prompt. using system is 32 times slower
Argument : one or more files.
Example : condense_script.pl th_lib.pl th-test.pl xxx xxxx ....
Function : makes compact size subroutines of developed perl codes
Options : None
Returns : xxxxxx.pl.out but sub routines condensed.
Usage : condense_script.pl xxxxxx.pl
Version : 1.0
Warning : The only condition is that you need to have 'sub xxxxx' from the
first column and the last '}' should be again at the first column
This is due to the pattern matching for any sub routines.
Argument : None
Function : initialize all developing codes by putting Header section infor
Returns : None
Usage : &initialze_code;
Version : 1.0
Warning : This writes over the program you run (itself). temp file is ini_code.temp
Argument : uses @ARGV
Example : &parse_arguments(1);
@files=@{&parse_arguments(1)};
Function : Parse and assign any types of arguments on prompt in UNIX to
the various variables inside of the running program.
This is more visual than getopt and easier.
just change the option table_example below for your own variable
setttings. This program reads itself and parse the arguments
according to the setting you made in this subroutine or
option table in anywhere in the program.
It also imports the ENV variables to your program.
Keywords : pass_arguments
Options : '0' to specify that there is no argument to sub, use
&parse_arguments(0);
parse_arguments itself does not have any specific option.
'#' at prompt will make a var $debug set to 1. This is to
print out all the print lines to make debugging easier.
'e=xxxx' for filtering input files by extension xxxx
Returns : Filenames in a reference of array
and input files in an array (file1, file2)=@{&parse_arguments};
Usage : &parse_arguments; or (file1, file2)=@{&parse_arguments};
Version : 2.1
Warning : HASH and ARRAY mustn't be like = (1, 2,3) or (1,2 ,3)
Argument : None.
Example : When you want to set 'a' char to a variable called '$dummy' in
the program, you put a head box commented line
'# $dummy becomes a by -a '
Then, the parse_arguments and this sub routine will read the head
box and assigns 'a' to $dummy IF you put an argument of '-a' in
the prompt.
Function : Assigns the values set in head box to the variables used in
the programs according to the values given at prompt.
This produces global values.
When numbers are given at prompt, they go to @num_opt
global variable. %vars global option will be made
Options : '#' at prompt will make a var $debug set to 1. This is to
print out all the print lines to make debugging easier.
Returns : Some globaly used variables according to prompt options.
@num_opt,
Usage : &assign_options_to_variables(\$input_line);
Version : 2.8
Warning : This is a global vars generator!!!
Argument : One or None. If you give an argu. it should be a ref. of an ARRAY
or a filename, or ref. of a filename.
If no arg is given, it reads SELF, ie. the program itself.
Example : Output is something like
('Title', 'read_head_box', 'Tips', 'Use to parse doc', ...)
Function : Reads the introductory header box(the one you see on top of sub routines of
Jong's programs.). Make a hash(associative array) to put entries
and descriptions of the items. The hash values have new lines '\n' are
attached, so that later write_head_box just sorts Title to the top
and prints without much calculation.
This is similar to read_head_box, but
This has one long straight string as value(no \n inside)
There are two types of ending line one is Jong's #---------- ...
the other is Astrid's #*************** ...
Keywords : open_head_box, open_headbox, read_headbox
Options : 'b' for remove blank lines. This will remove all the entries
with no descriptions
Returns : A hash ref.
Usage : %entries = %{&read_head_box([\$file_to_read, \@BOXED ] )};
Version : 2.7
Function : Reads the header box(the one you see on top of sub routines of
Jong's programs.)
There are two types of ending line one is Jong's #---------- ...
the other is Astrid's #*************** ...
Usage : %entries = %{&read_first_head_box(\$file_to_read )};
Version : 2.0
Argument : one or more filenames
Example : @hashes = @{&read_head_boxes(@ARGV)};
$num_of_sub = @hashes;
print "\n Number of subs was $num_of_sub\n";
Function : Reads the introductory header box(the one you see on top of sub routines of
Jong's programs.). Make a hash(associative array) to put entries
and descriptions of the items.
Returns : A hash ref.
Usage : %entries = %{&read_head_box(\$file_to_read, ,,, )};
Version : 1.2
Argument : One or None. If you give an argu. it should be a ref. of an ARRAY
or a filename, or ref. of a filename.
If no arg is given, it reads SELF, ie. the program itself.
Example : Output is something like
('Title', 'read_head_box', 'Tips', 'Use to parse doc', ...)
Function : Reads the header box(the one you see on top of sub routines of
Jong's programs.). This is similar to read_head_box, but
This has one long straight string as value(no \n inside)
There are two types of ending line one is Jong's #---------- ...
the other is Astrid's #*************** ...
Options : 'b' for remove blank lines. This will remove all the entries
with no descriptions
Returns : A hash ref.
Usage : %entries = %{&read_head_box(\$file_to_read )};
Version : 1.5
Function : Reads the header boxes(the one you see on top of sub routines of
Jong's programs.)
There are two types of ending line one is Jong's #---------- ...
the other is Astrid's #*************** ...
Usage : %entries = %{&read_all_head_box(\$file_to_read )};
Version : 1.0
Argument : a filename
Example : correct_head_box.pl Bio.pl
Function : Makes headbox in right and updated format. The most
updated headbox format is very this headbox. So, to
change all other headbox format, change this first.
Usage : just type correct_head_box.pl with a file name.
Version : 1.1
Function : This reads correct_head_box only.
Keywords : read_update_head_box, read update headbox
Options : v for verbose message printing.
Version : 1.0
Function : gets a hash ref. and writes the head box for a subroutine
Keywords : write_headbox
Options : v for verbose representation. This will print boxes on STDOUT
n for no '#' leader.
Version : 2.2
Example : &show_default_help2; &show_default_help2(\$arg_num_limit); &show_default_help2( '3' );
1 scalar digit for the minimum number of arg (optional),
or its ref. If this defined, it will produce exit the program
telling the minimum arguments.
Function : Prints usage information and others when invoked. You need to have
sections like this explanation box in your perl code. When invoked,
show_default_help routine reads the running perl code (SELF READING) and
displays what you have typed in this box.
After one entry names like # Function :, the following lines without
entry name (like this very line) are attached to the previous entry.
In this example, to # Function : entry.
Keywords : default_help
Returns : formated information
Usage : &show_default_help2; usually with 'parse_arguments' sub.
Version : 3.4
Warning : this uses format and references
Example : &show_default_help2; &show_default_help2(\$arg_num_limit); &show_default_help2( '3' );
1 scalar digit for the minimum number of arg (optional),
or its ref. If this defined, it will produce exit the program
telling the minimum arguments.
Function : Prints usage information and others when invoked. You need to have
sections like this explanation box in your perl code. When invoked,
show_default_help routine reads the running perl code (SELF READING) and
displays what you have typed in this box.
After one entry names like # Function :, the following lines without
entry name (like this very line) are attached to the previous entry.
In this example, to # Function : entry.
Keywords : default_help
Returns : formated information
Usage : &show_default_help2; usually with 'parse_arguments' sub.
Version : 3.4
Warning : USE show_default_help, This is not action oriented
Argument : 1 scalar digit for the minimum number of arg (optional),
or its ref. If this defined, it will produce exit the program
telling the minimum arguments.
Example : &show_default_help; &show_default_help(\$arg_num_limit); &show_default_help( '3' );
Function : prints usage information and others when invoked. You need to have
sections like this explanation box in your perl code. When invoked,
show_default_help routine reads the running perl code (self reading) and
displays what you have typed in this box.
After one entry names like # Function :, the following lines without
entry name (like this very line) are attached to the previous entry.
In this example, to # Function : entry.
Returns : formated information
Usage : &show_default_help; usually with 'parse_arguments' sub.
Version : 2.0
Warning : this uses format and references
Argument : many refs for hash (one for bottm, one for top, etc,top hash is usually
to denote certain caculations or results of the bottom one
Example : If there are 3 hashes output will be; (in the order of \%hash3, \%hash2, \%hash1)
>> 1st Hash >> 2nd Hash >> 3rd Hash
Name1 THIS-IS- Name123 eHHHHHHH Name123 12222223
You will get;
Name1 THIS-IS-
Name123 eHHHHHHH
Name123 12222223
Example of ( no option, DEFAULT ) # Example of ('i' or 'I' option,
INTERLACE )
6taa ----ATPADWRSQSIY # 6taa ------ATPADWRSQSIY
2aaa ------LSAASWRTQS # 6taa ------CCHHHHCCCCEE
1cdg APDTSVSNKQNFSTDV # 6taa ------563640130000
6taa ------CCHHHHCCCC # 2aaa ------LSAASWRTQSIY
2aaa ------CCHHHHCCCC # 2aaa ------CCHHHHCCCCEE
1cdg CCCCCCCCCCCCCCCC # 2aaa ------271760131000
6taa ------5636401300 # 1cdg APDTSVSNKQNFSTDVIY
2aaa ------2717601310 # 1cdg CCCCCCCCCCCCCCCCEE
1cdg 6752327236000000 # 1cdg 675232723600000000
Example of('s' or 'S' option,SORT) # Example of ('o' or 'O' option,
ORDERED by input hashes )
1cdg APDTSVSNKQNFSTDV # 6taa ------ATPADWRSQSIY
2aaa ------LSAASWRTQS # 2aaa ------LSAASWRTQSIY
6taa ------ATPADWRSQS # 1cdg APDTSVSNKQNFSTDVIY
1cdg CCCCCCCCCCCCCCCC # 6taa ------CCHHHHCCCCEE
2aaa ------CCHHHHCCCC # 2aaa ------CCHHHHCCCCEE
6taa ------CCHHHHCCCC # 1cdg CCCCCCCCCCCCCCCCEE
1cdg 6752327236000000 # 6taa ------563640130000
2aaa ------2717601310 # 2aaa ------271760131000
6taa ------5636401300 # 1cdg 675232723600000000
Function : gets many refs for one scalar or hashes and prints
the contents in lines of \$block_leng(the only scalar ref. given) char.
Keywords : print_sequence_in_block print_alignment_in_block
Options : 'o' or 'O' => ordered hash print,
'n' or'N' => no space between blocks.
's' or 'S' => printout sorted by seq names.
'i' or 'I' => interlaced print.(this requires identical names in hashes)
'v' or 'V' => show sequence start number at each line
'g' or 'G' => with gap chars between aa residues
l= for block length. Default is 60 char
t= for specifying the length of seq names shown
t for truncating seq names shwn to 12 chars.
f= for file output eg. f=XXXXXX.issa
r=digit-digit (eg. 10-70) to take only the defined region of sequences
digit-digit (eg. 10-70) to take only the defined region of sequences
just digit for block length
(all options can be like \$sort
while $sort has 's' as value. naked number like 100 will be the
block_length. 'i' or 'I' => interlaced print.(this requires
identical names in hashes)
Usage : &print_seq_in_block (\$block_leng, 'i',\%h1, 'sort', \%h2, \%hash3,,,);
Version : 1.6
Argument : many refs for hash (one for bottm, one for top, etc,top hash is usually
to denote certain caculations or results of the bottom one
Example : With command 'print_seq_in_columns.pl c2 s2', you get:
name1 11111111 name1 22222
name2 11 name2 2222222
name3 1111111 name3 22222
name4 11111 name4 2222
name5 11111 name5 222
name1 3333 name1 4444
name2 3333 name2 444
name3 333 name3 4
name4 333 name4 4444
name5 3333 name5 4444444
Function : gets many refs for one scalar or hashes and prints
the contents in lines of \$block_leng(the only scalar ref. given) char.
Options : c, i, s
Usage : &print_seq_in_block (\$block_leng, 'i',\%h1, 'sort', \%h2, \%hash3,,,);
Version : 1.1
Argument : one or more ref. of arrays
Example : &print_seq_in_block(&convert_arr_and_str_2_hash(\@input,\@input2,\@input3 ));
&convert_arr_and_str_2_hash(\$input1,\$input2, '2' );
results in; (ordering starts from the given '2')
array_2 input1 arraystring
array_3 input2 arraystring
one more exam
string_6 This is st and 3 strings)
string_10 This is st
array_2 111233434242
array_6 111233434242
array_10 111243424224
Function : makes hash(es) out of array(s)
if ordering digit(s) is put, it orders the keys according to it.
if ordering digit is not increased by one, the difference is used
as the increasing factor. No option results in
array_1, array_2, array_3...
Returns : one or more ref. of hashes.
Usage : ($hash1, $hash2)=&convert_arr_and_str_2_hash(\$input, \$input2, '1', '2'.. );
* This is the combination of convert_string_to_hash & convert_array_to_hash
Version : 1.0
Argument : one or more ref. of arrays
Example : &print_seq_in_block(&convert_string_to_hash(\$input,\$input2,\$input3 ));
&convert_string_to_hash(\$input1,\$input2, '2' );
results in; (ordering starts from the given '2')
string_2 input1 string
string_3 input2 string
Function : makes hash(es) out of string(s)
if ordering digit(s) is put, it orders the keys according to it.
if ordering digit is not increased by one, the difference is used
as the increasing factor. No option results in
string_1, string_2, string_3...
Returns : one or more ref. of hashes.
Usage : ($hash1, $hash2)=&convert_string_to_hash(\$input, \$input2, '1', '2'.. );
Version : 1.0
Argument : one or more ref. of arrays
Example : &print_seq_in_block(&convert_array_to_hash(\@input,\@input2,\@input3 ));
&convert_array_to_hash(\$input1,\$input2, '2' );
results in; (ordering starts from the given '2')
array_2 input1 arraystring
array_3 input2 arraystring
Function : makes hash(es) out of array(s)
if ordering digit(s) is put, it orders the keys according to it.
if ordering digit is not increased by one, the difference is used
as the increasing factor. No option results in
array_1, array_2, array_3...
Returns : one or more ref. of hashes.
Usage : ($hash1, $hash2)=&convert_array_to_hash(\$input, \$input2, '1', '2'.. );
Version : 1.0
Argument : one or more refs for arrays or one array.
Example : (1,1,1,1,3,3,3,3,4,4,4,3,3,4,4); --> (1,3,4);
Function : removes duplicate entries in an array. You can sort the
result if you wish by 's' opt. Otherwise, result will keep
the original order
Keywords : merge array elements, remove_repeting_elements,
remove_same_array_elements, remove_redundancy, remove_redundant_elements
remove_duplication_in_array
Options :
s for sorting the array output
Returns : one or more references.
Usage : @out2 = @{&remove_dup_in_array(\@input1, \@input2,,,,)};
@out1 = &remove_dup_in_array(\@input1 );
Version : 1.6
Argument : reference of one array of file names in pwd
Author : jong
Function : finds patterns of text and replaces them in multiple input files
Returns : nothing
Usage : &remove_text(\@input_array_of_filenames);
Version : 1.3
Warning : This produces a temporary file and rename it...
Argument : one or more refs for arrays. The first array is always the
only target.
Example : @TARGET=qw(1 % $ ^ # A B 4444 44 4 4 3 33 3 11 A 3 4 4 7 AB);
@remove=qw(\W); # removes all the non word stuff
@remove2=qw(\d );
@out=@{&remove_elements_by_pattern(\@TARGET, \@remove,\@remove2)};
Function : removes elements by pattern in the array
Keywords : remove_this_elements, remove_these_elements, remove_elements
remove_elements_by_position, kill_array_elements, kill_elements
take_away_elements, remove_array_elements
Returns : one or more references.
Usage : @out2 = @{&remove_elements_by_pattern(\@input1, \@input2,,,,)};
@out1 = @{&remove_elements_by_pattern(\@input1 )};
Version : 1.2
Argument : one or more refs for arrays. The first array is always the
only target. The removing elements can be scalar ref or
just scalar.
Example : ( two input: (1,2,3,4,4,4,5,5,6,7), (1,3,4) --> (2,5,5,6,7);
Function : removes elements by name in the array
Keywords : remove_this_elements, remove_these_elements, remove_elements
remove_elements_by_position, kill_array_elements, kill_elements
take_away_elements
Returns : one or more references.
Usage : @out2 = @{&remove_elements_by_name(\@input1, \@input2)};
@out1 = @{&remove_elements_by_name(\@input1, \$name )};
Version : 1.1
Argument : one or more refs for arrays. The first array is always the
only target.
Example : ( two input: (1,2,3,4,5,6,7), (1,3,4) --> (2 5 6 7);
Function : removes elements by name in the array
Keywords : remove_this_elements, remove_these_elements, remove_elements
remove_elements_by_position, kill_array_elements, kill_elements
take_away_elements
Returns : one or more references.
Usage : @out2 = @{&remove_elements_by_position(\@input1, \@input2,,,,)};
@out1 = @{&remove_elements_by_position(\@input1 )};
Version : 1.1
Warning : Position 1 means $array[0]
Argument : one or more refs for arrays.
Example : (1,1,1,1,3,3,3,3,4,4,4,3,3,4,4); --> (1,3,4);
Function : removes duplicate entries in an array. If you put
more than one array as inputs, it will produce references of
arrays merged singly. Each resulting array is independant.
CF. merge_many_arrays
Keywords : merge array elements, merge_array_elements,
Returns : one or more references.
Usage : @out2 = @{&merge_array(\@input1, \@input2,,,,)};
@out1 = @{&merge_array(\@input1 )};
Version : 1.1
Argument : Two or more refs for arrays.
Example : (1,2,3,4,5),(6,7,8,9,10)-----> (1,2,3,4,5,6,7,8,9,10)
Function : makes one array from several
Keywords : make_one_array, make_one_from_several
Returns : An array reference
Usage : @array_one=@{&make_one_array(\@input_array_1, \@input_array_2)};
Version : 1.0
Warning : This does not remove duplicate entries.
Argument : one or more refs for arrays.
Example : (1,1,1,1,3,3,3,3,4,4,4,3,3,4,4); --> (1,3,4);
if you put two arrays(1,1,1,3,3, 100) and (2,2, 4,4, 100), you will get
references of arrays( 1,3) and (2,4) ignoring single array entries.
Function : Gets any multiple array entry in a given array. If more than
one array is given, each array will have a reference return.
Keywords : multiple entry array, get_common_entry_array
Returns : one or more references.
Usage : @out2 = @{&merge_array(\@input1, \@input2,,,,)};
@out1 = @{&merge_array(\@input1 )};
Argument : one or more refs for arrays.
Example : (1,1,1,2,3,3,3,4) --> (1,3);
(1,2,3) (1,2,3,4,5) --> (1,2,3);
(1,2,3,4,5) (1,2,3,4,5) (3,4,5,6) --> (4,5);
Function : Gets any common array entry in given arrays. If one single array
is given, mutiply occurring entries in the array will be returned.
Keywords : multiple entry array, get_common_entry_array, multiply array,
get_common_array_elements, get_common_array_element,
get_dup_array_elements,
Returns : one or more references.
Usage : @out2 = @{&get_common_array_entry(\@input1, \@input2,,,,)};
@out1 = @{&get_common_array_entry(\@input1 )};
Version : 1.1
Warning : accepts only references of arrays(others are ignored).
Argument : one or more refs for arrays. or just arrays.
Example : (1,1,1,1,3,3), (1,3,3,4,4,4,3,3,4,4); --> (1,3,4);
Function : removes duplicate entries in multiple array inputs.
Keywords : merge array elements from multiple arrays. merge_array_elements
Returns : one reference.
Usage : @out2 = @{&merge_many_arrays(\@input1, @inputX, \@input2,,,,)};
@out1 = @{&merge_many_arrays(\@input1 )};
Version : 1.0
Warning : synonym of remove_dup_in_array
Argument : one or more refs for arrays.
Example : (1,1,1,1,3,3,3,3,4,4,4,3,3,4,4); --> (1,3,4);
Function : removes duplicate entries in an array. If you put
more than one array as inputs, it will produce references of
arrays merged singly. Each resulting array is independant.
CF. merge_many_arrays
Keywords : remove_dup_in_array, merge array elements, remove_duplicates,
Returns : one or more references.
Usage : @out2 = @{&remove_repetitives_in_array(\@input1, \@input2,,,,)};
@out1 = @{&remove_repetitives_in_array(\@input1 )};
Version : 1.0
Warning : synonym of remove_dup_in_array
Function : returns hash refs. after filtering with threshold value.
Usage : ($ref1, $ref2, $ref3)=&filter_hash_by_num_value(\%h1, \$thres,...);
Version : 1.0
Argument : One Ref. for a scalar.
Function : With given full path or single name for a dir. it returns
the full path dir name. If it fails to find in pwd or given
specified path, it tries to search PATH, HOME etc..
Returns : one Ref. for an array.
Usage : $output_best_possible_dir = ${&dir_search_single(\$input_name)};
Version : 1.0
Argument : One Ref. for a scalar.
Function : With given full path or single name for a dir. it returns
the full path dir name. If it fails to find in pwd or given
specified path, it tries to search PATH, HOME etc..
Returns : one Ref. for an array.
Usage : @output_possible_dirs = @{&dir_search(\$input_name)};
Version : 1.0
Keywords : exchange_msp_columns, open_and_exchange_query_with_match_in_msp,
open_msp_files_with_exchange_of_columns
Options :
R for NO range attachment in Name only return option (n)
e= for evalue threshhold, if e=1, ignores all which are over 1
s= for score threshhold if t=100, ignores all which are less 100
Usage : @exchanged_msp=@{&exchange_query_with_match_in_msp(\@file)};
Version : 1.1
Author : Jong Park, jong@salt2.med.harvard.edu, for commercial use, ask me.
Keywords : run_ssearch_sequence_search, do_fasta_sequence_search
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
i= to get file base(root) name. same as File=
m for MSP format directly from FASTA or Ssearch result than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
s= for the single big msp file name
O= for Out file name, same as s=
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
d for very simple run and saving the result in xxxx.gz format in sub dir starting with one char
r for reverse the query sequence
R for attaching ranges of sequences
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
d= for defining the size of subdir made. 2 means it creates
eg, DE while 1 makes D
d for $make_gz_in_sub_dir_opt, putting resultant sso files in gz format and in single char subdir
D for $make_msp_in_sub_dir_opt, convert sso to msp and put in sub dir like /D/, /S/
n for new format to create new msp file format with sso_to_msp routine
PVM= for PVM run of FASTA (FASTA only)
M for machine readable format -m 10 option
M= for machine readable format -m 10 option
N for 'NO' do not do any processing but, do the searches only.
FILE_AGE for defining the age of file in days to be overwritten.
Usage : $gzipped_msp_file=${&run_fasta_sequence_search("a=$algorithm",
"O=$out_file_msp_name", "File=$temp_file_name", "e=$E_val",
"DB=$sequence_DB", "k=$k_tuple", "$machine_readable")};
Version : 1.1
Keywords : run_blastp, run_blastp_seq_search, blastp_seq_search, blastp_search,
do_blast_search
Options :
r for reverseing enquiry sequences
T= for Blastp T param
S= for Blastp S param
B= for Blastp B param
V= for Blastp V param
E= for Blastp E param
Usage : &do_blastp_search(\@file);
Version : 1.3
Example : &self_self_search(\@file, $over_write, $msp_directly_opt, $create_sso, $single_big_msp);
Function : self_to_self input database search with reverse query as an option
Keywords : do_self_self_search, self_self_sequence_search, self_self_seq_search,
self_to_self_search, self_to_rev_self_search, self_to_reversed_self_search
search_self, search_self_seq, search_self
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
m for MSP format directly from FASTA or Ssearch result than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
r for reverse the query sequence
R for attaching ranges of sequences
b for doing in batch. Reads all the seqs in memory at one time
m10 for machine readable form
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
d= for defining the size of subdir made. 2 means it creates
eg, DE while 1 makes D
d for $make_gz_in_sub_dir_opt, putting resultant sso files in gz format and in single char subdir
D for $make_msp_in_sub_dir_opt, convert sso to msp and put in sub dir like /D/, /S/
n for new format (msp2 format)
Usage : &self_self_search(\@file, $over_write, $msp_directly_opt, $create_sso, $single_big_msp);
Version : 2.2
Example : &search_self(\@file, $over_write, $msp_directly_opt,
$create_sso, $single_big_msp);
Function : self_to_self input database search with reverse query as an option
Keywords : do_search_self, self_self_sequence_search, self_self_seq_search,
self_to_self_search, self_to_rev_self_search, self_to_reversed_self_search
search_self, search_self_seq
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
m for MSP format directly from FASTA or Ssearch result
than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
r for reverse the query sequence
R for attaching ranges of sequences
b for doing in batch. Reads all the seqs in memory at one time
m10 for machine readable form
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
d= for defining the size of subdir made. 2 means it creates
eg, DE while 1 makes D
d for $make_gz_in_sub_dir_opt, putting resultant sso files
in gz format and in single char subdir
D for $make_msp_in_sub_dir_opt, convert sso to msp and
put in sub dir like /D/, /S/
n for new format (msp2 format)
Usage : &search_self(\@file, $over_write, $msp_directly_opt,
$create_sso, $single_big_msp);
Version : 2.5
Function : gets all the clu files and producesf one xxxx.cdf file
CDF file is a fasta database file with all the clu domains are
Keywords : make_cdf_file_with_clu, clu_to_cdf, clu_2_cdf
Usage : @file=@{&parse_arguments(1)};
Version : 1.1
Function : creates xxxx.fa.idx file and makes a link to pwd. If @file contains
names with .idx extension already, it will not put another idx
index to it.
Keywords : make_fasta_seq_index_file, create_seq_index_file, make_idx_file,
create_idx_file, create_seq_idx_file, make_index_file, create_index_file
make_sequence_index_file, create_sequene_index_file
Usage : @idx_files_made=@{&make_seq_index_file(\@file)};
Version : 1.4
Keywords : find_palindromes, get_palindromes, find_palindrome, GetPalindrom
: filter_hash_by_string_length
Options :
min= for miniumum palindrome size
p for putting the position of the start of the palindrome
:
cutoff_min= by cutoff_min=
c= by c= # the same as cutoff_min
Usage : search_palindromes(\%seq, [\%seq2]);
: %out=%{&filter_by_string_length(\%hash, [100], ["cutoff=100"])};
Version : 1.2
: 1.0
Function : sort files by size and returns the ref of the array
Keywords : sort_file_by_size
Usage : @sorted=@{&sort_files_by_size(\@files)};
Version : 1.0
Example : &make_fasta_files_from_msp_1_files(\@files, "E=0.081", "l=0");
&make_fasta_files_from_msp_1_files(\@files,
"E=$E_thresh",
"Seq_Source_DB=$seq_source_db",
"l=$lower_expect_limit",
"i=$cut_off_increase_factor",
$over_write_file);
Function : creates fasta files for each query seq in xxxx.msp_1 file
Options :
$seq_source_db= by "Seq_Source_DB=xxxxx.fa"
$E_thresh = by E= # E value cutoff
u= for $upper_expect_limit
l= for $lower_expect_limit
$over_write_file=o by o -o
$cut_off_increase_factor = by i=
s for selfless fasta out put (removes the original self seq among intermediates)
Usage : &make_fasta_files_from_msp_1_files(\@files);
Version : 1.2
Function : filters intermediate sequences according to the E value
thresholds and returns the lines in an array
Usage : @filtered_msp3=@{&filter_intermediates_by_E_value(\@msp3,
"E1=$E_value1", "E2=$E_value2")};
Version : 1.0
Function : extracts intermediate sequences from OWL fasta database to
make intermediate seq library
This looks for /gn0/jong/DB/PDB/PDB95D_against_OWL/E/$msp_file_gz
and /gn0/jong/DB/PDB/PDB95D_against_OWL/D/$msp_file_gz
Keywords : make_interm_library_for_each_group, make_interm_lib,
make_intermediate_library, compile_interm_library, create_interm_library,
Options :
'FASTA_DB' for sequence source fasta file eg: "FASTA_DB=$source_db_fasta"
o for overwrite option (overwrites 1.2.3.fa like file)
MSP_DIR= for msp seq file result directory
m= for msp seq file result direc (same as MSP_DIR)
e= for E value thresh
$pdbg_file= by p=
E= for E value thresh
s= for score thresh
Usage : &make_intermediate_sequence_library(\@files, "FASTA_DB=$owl_db_fasta");
while @files have either pdbs or pdbg file (PDB grouping file)
Version : 1.7
Keywords : read_m10_sso_lines read_msso_lines
Options : a c r r2 n
u= for upper E value limit
l= for lower E value limit
Usage : @out_refs=@{&read_machine_readable_sso_lines(\@SSO, $get_alignment,
$create_sso, $upper_expect_limit,$new_format, $lower_expect_limit,
$attach_range_in_names, $attach_range_in_names2)};
Version : 1.5
Function : Main subroutine for open_sso_files. This calls either machine
readable or unreadable form parsing subroutine
Keywords : read_sso_lines_in_array
Options : a c r r2 n
u= for upper E value limit
l= for lower E value limit
Usage : &read_sso_lines([@sso], $create_sso, $attach_range_in_names,
$attach_range_in_names2, $new_format, $get_alignment) );
Version : 1.4
Example :
seq3 ---lasdkfjklsdjfkldjklfj----
seq4 dfasdfasdfadsfsadfsaas
will result in
seq3 ---lasdkfjklsdjfkldjklfj----
seq4 dfasdfasdfadsfsadfsaas------
Function : creates hashes with values of equal lengths.
Keywords : make_alignment_length_even, equalise_seq_alignments
Usage : @out=@{&make_seq_alignment_length_even(\%hash1, \%hash2)};
@out=@{&make_seq_alignment_length_even(\%hash1)};
Version : 1.0
Function : Returns a unique temporary filename.
Reasonably robust but not completely immune to race conditions
with other processes simultaneously requesting a tempname.
Usage : $tmp=&tempname;
Version : 1.0
Author : Sarah A. Teichmann
Date : 19th September 1997
Example : &fasta_kt1_search ($qdb_main, $tdb_main, $fastaver_main);
Function : to search one database against the other using fasta
ktup=1 (default is simply "fasta"). The results are stored in sub dirs
which are from the 2 first chars of the query sequence.
Keywords : fasta_search, fasta_database_search
Usage : &fasta_kt1_search($query_database, $target_database, $fasta_version_to_use
Version : 1.1
Author : Sarah A. Teichmann with thanks to Alex Bateman
Function : To make a hash with all the genes in the msp files as the keys,
which are linked at or below the E-value threshhold,
with the values denoting the cluster number
Keywords : single_linkage, msp_single_linkage, msp_single_linkage_hash
Usage : %hash=%{&msp_single_link_hash(\@msp_files, E-value);
Version : 1.3
Author : Sarah A. Teichmann
Function : To print out a file in cluster file format from an input hash containing the genes as keys and the cluster number as values.
Keywords : print_single_linkage_cluster, print_cluster_file
Usage : &print_clusfile_from_hash(\%hash)
Version : 1.2
Author : Sarah A. Teichmann
Date : 19th September 1997
Function : to make a summary file of a sorted cluster file
Keywords : summary, make_cluster_summary, subclustering summary
Usage : &make_summ($sorted_cluster_file)
Version : 1.5
Author : Sarah A. Teichmann, modified by Jong
Date : 19th September 1997
Function : to make a "sorted_cluster_file" from the .clu files in a directory
Keywords : make_cluster_file, sort_clu_files
Usage : &create_sorted_cluster
Version : 1.7
Example : &interm_lib_search(\@file, $over_write, $msp_directly_opt, $create_sso, $single_big_msp);
Function : self_to_self input database search with reverse query as an option
Keywords : do_interm_lib_search, self_self_sequence_search, self_self_seq_search,
self_to_self_search, self_to_rev_self_search, self_to_reversed_self_search,
Options :
Query_seqs= for enquiry sequences eg) "Query_seqs=$ref_of_hash"
DB= for target DB "DB=$DB_used"
File= to get file base(root) name. "File=$file[0]"
m for MSP format directly from FASTA or Ssearch result than through sso_to_msp to save mem
s for the big single output (msp file output I mean)
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
r for reverse the query sequence
b for doing in batch. Reads all the seqs in memory at one time
m10 for machine readable form
k= for k-tuple value. default is 1 (ori. FASTA prog. default is 2)
u= for $upper_expect_limit
l= for $lower_expect_limit
a= for choosing either fasta or ssearch algorithm
d for $make_gz_in_sub_dir_opt, putting resultant sso files in gz format and in single char subdir
D for $make_msp_in_sub_dir_opt, convert sso to msp and put in sub dir like /D/, /S/
n for new format (msp2 format)
FILE_AGE for defining the age of file in days to be overwritten.
Usage : &interm_lib_search(\@file, $over_write, $msp_directly_opt, $create_sso, $single_big_msp);
&interm_lib_search(\%seq, $over_write, $msp_directly_opt, $create_sso, $single_big_msp);
Version : 1.8
Author : Sarah A Teichmann, Jong Park, sat@mrc-lmb.cam.ac.uk,
jong@salt2.med.harvard.edu
Example : geanfammer.pl E_gnme.fa # simplest form
geanfammer.pl E_gnme.fa a=ssearch # use SSEARCH
geanfammer.pl E_gnme.fa o # for overwriting
when you want a
fresh run ovr old
geanfammer.pl E_gnme.fa c # For keeping
SSO files
(fasta output)
geanfammer.pl E_gnme.fa k=2 # changing default
k tuple for
FASTA to 2
geanfammer.pl E_gnme.fa E=0.01 # set the E value
for initial single
linkage clustering
geanfammer.pl E_gnme.fa e=0.01 # set the E value
for domain level linkage
--> geanfammer.pl E_gnme.fa e=0.01 E=0.01 # set the 2 E values
separately (no need
to do this)
Function : Creates a domain level clustering file from a given
FASTA format sequence DB. It has been used for complete
genome sequence analysis.
------------ USAGE INFORMATION -------------------
The parameters you put are important for the meaningful
protein family maker.
The most important one is the E and e options (Mostly,
they can have same value).
Large E is for setting the threshold for the single
linkage clustering.
This means, any sequence hit BELOW the threshold
(which is good ) will be linked.
For example, if Seq1 matched with Seq2 with E value
of FASTA search:
0.001, and you set the threshold 0.1, then YOU
ordered the geanfammer to regard them a family.
The second small e option is for the dividing a complex
and wrong cluster into correct more correct
duplication modules. This is necessary as a
lot of multidomain proteins can be clustered together
WRONGLY by single linkage.
At this stage, the e value is irrelevant to E value
and you can set a higher or lower one. Or you can set
the same as E.
Rough guide from our experience for E and e values:
We know that with 1000 sequence database, 0.01
produces around 1% error in grouping sequences
according to the evalue.
With 180,000, 0.081 gave us less than 1% error.
Evalue of FASTA and SSEARCH is DEPENDENT on DB size,
so you need to play a little bit to know the best
E value for your database or genome.
The best approach is :
1) You run geanfammer.pl with any of your target DB
with certain E value you like
2) Check sequence families which are clustered
in the final resultant file xxxx.gclu and decide
if the E value is low or high. Lower evalues will
make sure you do not make wrong clusters while
high evalue will include more probable sequence
family members.
3) Put all the xxxx.msp files in subdirectory(s)
created by geanfammer and run divclus.pl (which
is accompanied in the package) with different
Evalues. Divclus will not run any search algorithm
etc, so it can be done fairly quickly.
Keywords : genome_analysis_and_protein_family_maker,
genome_ana_protein_fam_maker
Options :
o for overwrite existing xxxx.fa files for search
c for create SSO file (sequence search out file)
d for very simple run and saving the result in
xxxx.gz format in sub dir starting with one char
N
s
m
v
z
D
y for dynamic factor
L for Lean output(removes all the intermediate
outputs to save space)
u for making separate summary file (redundant now)
DB=
File=
k= for k-tuple value. default is 1 (ori. FASTA prog.
default is 2)
a= for choosing either fasta or ssearch algorithm
E= for Evalue cutoff for single linkage clustering
$E_cut_main
e= for Evalue cutoff for divide_clusters subroutine.
u=
l=
d=
!! Do not remove the following lines down to # Author line.
This program parses them
$Lean_output=L by L -L
$dynamic_factor=y by y Y -y -Y
$over_write=o by o -o
$create_sso_file=c by c -c
$k_tuple= by k=
$upper_expect_limit= by u=
$lower_expect_limit= by l=
$algorithm= by a=
$No_processing=N by N -N
$single_msp=s by s -s
$sequence_db_fasta= by DB=
$query_file= by File=
$machine_readable=M by M -M
$make_subdir_out=D by D
$make_subdir_gzipped=d by d -d
$direct_MSP_conversion=m by m -m
$verbose=v by v -v
$sub_dir_size= by d=
$Evalue_cut_single_link= by E=
$Evalue_cut_divclus= by e=
$optimize=z by z -z
$make_separate_summary=u by u -u
$length_thresh= by T=
Usage : &geanfammer(\@your_genome_or_db_to_analyse_file,
$verbose);
Version : 2.5
Function : This does ISS and makes HTML file to return to HTTPD server
Keywords : intermdediate_sequence_search_server
Options : z S s
Usage : &ISS_server(\%seq, "e=$E_val", "k=$ktuple", "a=$algorithm", "t=$leng_thresh",
"$which_score", $show_raw_result, $segged_ISSL );
Version : 1.3
Function : You can use any ENV set variables directly in your
program. So, you can say $USER instead of $ENV{'USER'}
Keywords : import_Env_vars, import_ENV_variables
Version : 1.1
Argument : takes one or more scaler references.
Function : read any dir and REMOVES the '.' and '..' entries. And
then put in array.
Keywords : read_any_dir_for_dir_command
Returns : one ref. of array.
Usage : @file_list = @{&read_any_dir(\$absolute_path_dir_name, ....)};
Version : 1.1
Warning : This does not report '.', '..', '#xxxx', ',xxxx', etc. only legitimate
: file and dir names are reported.
Function : Makes similarrity matrix hash(reflexive, so it has AT as well as TA)
%matrix looks like this: $matrix{X}{Y}= 4
Keywords : get_2D_aa_matrix, read_seq_matrix
Options :
$reflexive_combi=r by r -r
Usage : %matrix=%{&read_seq_matrix_files(\@file)};
Version : 1.2
Example : @file=@{&check_input_file_extension('msp', \@file)};
Usage : @file=@{&check_input_file_extension('msp', \@file)};
or @file=@{&check_input_file_extension('msp,nhco', \@file)};
for multiple extension allowance
Version : 1.1
Options :
$source_DB_file= by d= s=
$input_seq_file= by i=
$Eval_limit= by E=
$iteration_limit= by j=
$step_evalue= by h= e=
$over_write=o by o
$make_msp_in_sub_dir_opt=D by D
Usage : &do_psi_blast_search(\@files, "d=$source_DB_file",
"i=$input_seq_file",
$over_write,
$make_msp_in_sub_dir_opt);
Version : 1.3
Author : jong@salt2.med.harvard.edu
Function : checks if all the args are defined
Keywords : check_if_exists, check_if_file_exist
Usage : $defined=&complain_if_not_defined($var, $file);
Version : 1.0
Author : jong@salt2.med.harvard.edu
Function : checks if all the args are present
Keywords : die_unless_present, die_unless_file_present
Usage : &die_if_file_not_present($var, $file);
Version : 1.0
Author : jong@salt2.med.harvard.edu, On commercial use issue, Email me.
Function : asks for env var and write the env var to appropriate shell
UNIX only RC file
Keywords : write_ENV_vars, write_env_vars
Usage : &ask_for_ENV_vars('BLAST_DIR');
Version : 1.0