RestrictionDigest DESCRIPTION RestrictionDigest can simulate the reference genome digestion and generate comprehensive information of the simulation. It can simulate single-enzyme digestion, double-enzyme digestion and size selection process. It can also analyze multiple genomes at one run and generates concise comparison of enzyme(s) performance across the genomes. INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install METHODS Class RestrictionDigest::SingleItem::Double Functions new add_ref add_enzyme_pair new_enzyme change_range change_lengths_distribution_parameters add_output_dir double_digest add_SNPs count_SNPs_at_fragments add_gff all_frags_coverage_ratio frags_in_range_covearge_ratio Script Example: use RestrictionDigest; my $double_digest=RestrictionDigest::SingleItem::Double->new(); $double_digest->add_ref(-reference =>’ path to the reference file’); $double_digest->add_enzyme_pair(-front_enzyme =>’EcoRI’, -behind_enzyme =>’HinfI’); $double_digest->change_range(-start =>301,-end =>500); $double_digest->change_lengths_distribution_parameters(-front =>200,-behind =>800,-step =>25); $double_digest->add_output_dir(-output_dir=>’ path to the output directory’); $double_digest->double_digest(); $double_digest->add_SNPs(-SNPs =>’ path to the SNPs file’); $double_digest->count_SNPs_at_fragments(-sequence_type =>’125SE’, -sequence_end =>’front_enzyme’); $double_digest->add_gff(-gff =>’ path to the GFF file’); $double_digest->frags_in_range_coverage_ratio(); Class RestrictionDigest::SingleItem::Single Functions new add_ref add_single_enzyme new_enzyme change_range change_lengths_distribution_parameters add_output_dir single_digest add_SNPs count_SNPs_at_fragments add_gff all_frags_coverage_ratio frags_in_range_covearge_ratio Script Example: use RestrictionDigest; my $single_digest=RestrictionDigest::SingleItem::Single->new(); $single_digest->add_ref(-reference=>’path to the reference file’); $single_digest->add_single_enzyme(-enzyme =>’EcoRI’); $single_digest->change_range(-start =>301,-end =>500); $single_digest->change_lengths_distribution_parameters(-front =>200,-behind =>800,-step =>25); $single_digest->add_output_dir(-output_dir=>’path to the output directory’); $single_digest->single_digest(); $single_digest->add_SNPs(-SNPs =>’ path to the SNPs file’); $single_digest->count_SNPs_at_fragments(-sequence_type =>’150SE’); $single_digest->add_gff(-gff =>’ path to the GFF file’); $single_digest->frags_in_range_coverage_ratio(); Class RestrictionDigest::MultipleItems::Double Functions new add_refs add_enzyme_pair new_enzyme change_lengths_distribution_parameters add_output_dir digests_and_compare Script Example: use RestrictionDigest; my $multiple_double_digest=RestrictionDigest::MultipleItems::Double->new(); $multiple_double_digest->add_refs(-ref1 =>’path to reference 1 file’, -ref2 =>’path to reference 2 file’…); $multiple_double_digest->add_enzyme_pair(-front_enzyme=>’EcoRI’, -behind_enzyme => ‘HinfI’); $multiple_double_digest->change_lengths_distribution_parameters(-front => 200, -behind => 800, -step => 25); $multiple_double_digest->add_output_dir (-output_dir => ‘path to the output directory’); $multiple_double_digest->digests_and_compare(); Class RestrictionDigest::MultipleItems::Single Functions new add_refs add_single_enzyme new_enzyme change_lengths_distribution_parameters add_output_dir digests_and_compare Script Example: use RestrictionDigest; my $multiple_double_digest=RestrictionDigest::MultipleItems::Single->new(); $multiple_single_digest->add_refs(-ref1 =>’path to reference 1 file’, -ref2 =>’path to reference 2 file’…); $multiple_single_digest->add_enzyme_pair(-front_enzyme=>’EcoRI’, -behind_enzyme => ‘HinfI’); $multiple_single_digest->change_lengths_distribution_parameters(-front => 200, -behind => 800, -step => 25); $multiple_single_digest->add_output_dir (-output_dir => ‘path to the output directory’); $multiple_single_digest->digests_and_compare(); FORMATS OF THE INPUT REFERENCE FILE, GFF FILE, AND THE SNP COORDINATES FILE 1) RestrictionDigest is designed to process the reference file in the FASTA format. It should be in the following format: >ChromsomeName1 ATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCG >ChromsomeName2 ATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCG 2) For the "GFF" file, it should be in the standard GFF format, here is an example: ############################################################################################## scaffold39120 GLEAN mRNA 36514 50111 . + . ID=OYG_10003188_10008591; scaffold39120 GLEAN Exon 36514 36534 . + . Parent=OYG_10003188_10008591; scaffold39120 GLEAN Exon 43276 43353 . + . Parent=OYG_10003188_10008591; scaffold39120 GLEAN Exon 43766 43868 . + . Parent=OYG_10003188_10008591; scaffold39120 GLEAN Exon 44710 44741 . + . Parent=OYG_10003188_10008591; scaffold39120 GLEAN Exon 49875 50111 . + . Parent=OYG_10003188_10008591; scaffold838 GLEAN mRNA 36580 65627 . - . ID=OYG_10003281_10026064; ############################################################################################### The gff file should contain 9 columns and these columns must be seperated by '\t'. 3) For the SNP coordinates file, it should contain 3 columns, here is an example: scaffold18356 19 R scaffold18356 20 Y scaffold18356 55 G scaffold18356 60 Y scaffold18356 88 Y scaffold18356 97 M scaffold18356 114 M EXPLANATIONS OF RESULT FILES 1. Files produced by RestrictionDigest::SingleItem::Double. 1) "position_FB_BF_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "position_FF_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "position_BB_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" The positions of digested fragments(all fragments or fragments in range) are saved in these file. There are 7 columns whose descriptions are as follows: '>Name of fragments', 'strand','FRONT_ENZYME/behind', 'start position', 'FRONT_ENZYME/behind', 'end position', 'length of fragment'. a) >Name of fragments. The name of this fragment, the format consists 3 parts. The 1st part is '>'. The 2nd part is the name of scaffold in the reference file User provides. The 3rd part is the serial num of this fragment in all fragments produced in this scaffold. b) strand. The location of this fragment in the reference sequence, + or -. c) FRONT_ENZYME/behind. The corresponding enzyme of the overhang of front end of this fragments. If the strand is "+" strand, the enzyme is behind enzyme, else, the enzyme is front enzyme. d) start position. The start position of this fragment on the specific scaffold. It is the start position of the overhang, not the recognition base position. e) FRONT_ENZYME/behind. Like the item 'c', but the opposite one to it. f) end postion. The end position of this fragment on the specific scaffold. It is also the end position of the overhang, not the recognition base position. g) length of fragment. The base length of this fragment. 2) "seq__FB_BF_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "seq_FF_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "seq_BB_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" The seqences of digested fragments(all fragments or fragments in range) and their names are saved in these files. There are many double-line. The first line of the double_line is the name of the fragment in the ">chromsomeName-[0-9]+" format. The second line of the double_line is the sequence of this fragment. 3) "coverage_ratio_every_chromosome_REFERENCE_by_ENZYME". The coverage ratios of every chromosome covered by three kinds of fragments are saved in this file. chromosome_name all_type_overall_rate all_type_in_range_rate FB_BF_overall_rate FB_BF_in_range_rate FF_overall_rate FF_in_range_rate BB_overall_rate BB_in_range_rate 4) "digestion_summary_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME". loci_number_of_FRONT_ENZYME loci_number_of_BEHIND_ENZYME all_3types_fragments_number all_3types_fragments_coverage all_FB_BF_fragments_number all_FB_BF_fragments_coverage all_FF_fragments_number all_FF_fragments_coverage 3types_fragments_in_range_number 3types_fragments_in_range_coverage FB_BF_fragments_in_range_number FB_BF_fragments_in_range_coverage FF_fragments_in_range_number FF_fragments_in_range_coverage BB_fragments_in_range_number BB_fragments_in_range_coverage Length_range three_types_number three_types_ratio FB_BF_number FB_BF_ratio FF_number FF_ratio BB_number BB_ratio 0bp-200bp 201bp-250b 251bp-300bp 301bp-350bp 351bp-400bp 401bp-450bp 451bp-500bp 501bp-550bp 551bp-600bp 601bp-650bp 651bp-700bp 701bp-750bp 751bp-800bp 801bp-850bp 851bp-900bp 901bp-950bp 951bp-1000bp 1001bp-longer Expected_SNPs_at_all_fragments_via_SEQUNCE_TYPE: Expected_SNPs_at_fragments_in_range_via_SEQUNCE_TYPE: 5) "genome_coverage_by_FF_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "genome_coverage_by_FB_BF_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" "genome_coverage_by_BB_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" The coverage ratios of different genomic structures of every chromosome covered by these three kinds of fragments(all fragments or fragments in range) are saved in these files. If User apply the 'all_frags_coverage_ratio' or the 'frags_in_range_coverage_ratio' method, this file will be produced. There are 13 columns in this file. They are: 'IntergenicLength', 'IntergenicMapLength','IntergenicMapRatio', 'GenesLength','GenesMapLength', 'GenesMapRatio', 'ExonLength', 'ExonMapLength', 'ExonMapRatio', 'IntronLength', 'IntronMapLength','IntronMapRatio'. The first line is header. a) GeneLength, ExonLength, IntronLength. The length of these different parts, if some part does not exist, N/A is given. b) GeneMapLength, ExonMapLength, IntronMapLength. The length of mapped part by all digested fragments or fragments in range of these parts. If some part does not exists, 0 is given. c) GeneMapRatio, ExonMapRatio, IntronMapRatio. The ratio of mapped part by all digested fragments or fragments in range of these parts. If some part does not exits, 0 is given. 2. Files produced by RestrictionDigest::SingleItem::Single. 1) "position_frags(_in_range_)?_REFERENCE_by_ENZYME". The positions of digested fragments(all fragments or fragments in range) are saved in this file. There are 7 columns whose descriptions are as follows: '>Name of fragments', 'start position', 'end position', 'length of fragment'. a) >Name of fragments. The name of this fragment, the format consists 3 parts. The 1st part is '>'. The 2nd part is the name of scaffold in the reference file User provides. The 3rd part is the serial num of this fragment in all fragments produced in this scaffold. b) start position. The start position of this fragment on the specific scaffold. It is the start position of the overhang, not the recognition base position. c) end postion. The end position of this fragment on the specific scaffold. It is also the end position of the overhang, not the recognition base position. d) length of fragment. The base length of this fragment. 2) "seq_frags(in_range_)?_REFERENCE_by_ENZYME". The seqences of digested fragments(all fragments or fragments in range) and their names are saved in this file. There are many double-line. The first line of the double_line is the name of the fragment in the ">chromsomeName-[0-9]+" format. The second line of the double_line is the sequence of this fragment. 3) "coverage_ratio_every_chromosome_REFERENCE_by_ENZYME". The length ratio of fragments in every scaffold are saved in this file. There are three columns in this file. The first column contains the names of scaffolds. And the second column contains the length ratio of all fragments in this scaffold. And the third column contains the length ratio of fragments in range in this scaffold. 4) "digestion_summary_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME". loci_number_of_ENZYME all_fragments_number all_fragments_coverage fragments_in_range_number fragments_in_range_coverage Length_range fragments_number fragments_ratio 0bp-200bp 201bp-250b 251bp-300bp 301bp-350bp 351bp-400bp 401bp-450bp 451bp-500bp 501bp-550bp 551bp-600bp 601bp-650bp 651bp-700bp 701bp-750bp 751bp-800bp 801bp-850bp 851bp-900bp 901bp-950bp 951bp-1000bp 1001bp-longer Expected_SNPs_at_all_fragments_via_SEQUNCE_TYPE: Expected_SNPs_at_fragments_in_range_via_SEQUNCE_TYPE: 5) "genome_coverage_ratio(_in_range)?_REFERENCE_by_ENZYME". If User apply the 'all_frags_coverage_ratio' or the 'frags_in_range_coverage_ratio' method, this file will be produced. There are 13 columns in this file. They are: 'IntergenicLength','IntergenicMapLength','IntergenicMapRatio', 'GenesLength', 'GenesMapLength', 'GenesMapRatio', 'ExonLength', 'ExonMapLength', 'ExonMapRatio', 'IntronLength', 'IntronMapLength','IntronMapRatio'. The first line is header. a) b) GeneLength, ExonLength, IntronLength, UTRLength. The length of these different parts, if some part does not exist, N/A is given. c) GeneMapLength, ExonMapLength, IntronMapLength, UTRMapLength. The length of mapped part by all digested fragments or fragments in range of these parts. If some part does not exists, 0 is given. d) GeneMapRatio, ExonMapRatio, IntronMapRatio, UTRMapRatio. The ratio of mapped part by all digested fragments or fragments in range of these parts. If some part does not exits, 0 is given. 3. Files produced by RestrictionDigest::MultipleItems::Double. 1) "digestion_summary_multiple_genomes_by_FRONT_ENZYME_and_BEHIND_ENZYME" Number_of_enzyme_loci REFERENCE1 REFERENCE2 .... Loci_number_of_FRONT_ENZYME Loci_number_of_BEHIND_ENZYME Length_range three_types_number_REFERENCE1 three_types_ratio_REFERENCE1 FB_BF_number_REFERENCE1 FB_BF_ratio_REFERENCE1 FF_number_REFERENCE1 FF_ratio_REFERENCE1 BB_number_REFERENCE1 BB_ratio_REFERENCE1 three_types_number_REFERENCE2 three_types_ratio_REFERENCE2 FB_BF_number_REFERENCE2 FB_BF_ratio_REFERENCE2 FF_number_REFERENCE2 FF_ratio_REFERENCE2 BB_number_REFERENCE2 BB_ratio_REFERENCE2 0bp-200bp 201bp-250b 251bp-300bp 301bp-350bp 351bp-400bp 401bp-450bp 451bp-500bp 501bp-550bp 551bp-600bp 601bp-650bp 651bp-700bp 701bp-750bp 751bp-800bp 801bp-850bp 851bp-900bp 901bp-950bp 951bp-1000bp 1001bp-longer 4. Files produced by RestrictionDigest::MultipleItems::Single. 1) "digestion_summary_multiple_genomes_by_ENZYME" Number_of_enzyme_loci REFERENCE1 REFERENCE2 .... Loci_number_of_ENZYME Length_range fragments_number_REFERENCE1 fragments_ratio_REFERENCE1 fragments_number_REFERENCE2 fragments_ratio_REFERENCE2 0bp-200bp 201bp-250b 251bp-300bp 301bp-350bp 351bp-400bp 401bp-450bp 451bp-500bp 501bp-550bp 551bp-600bp 601bp-650bp 651bp-700bp 701bp-750bp 751bp-800bp 801bp-850bp 851bp-900bp 901bp-950bp 951bp-1000bp 1001bp-longer SUPPORT AND DOCUMENTATION After installing, you can find documentation for this module with the perldoc command. perldoc RestrictionDigest You can also look for information at: RT, CPAN's request tracker (report bugs here) http://rt.cpan.org/NoAuth/Bugs.html?Dist=RestrictionDigest AnnoCPAN, Annotated CPAN documentation http://annocpan.org/dist/RestrictionDigest CPAN Ratings http://cpanratings.perl.org/d/RestrictionDigest Search CPAN http://search.cpan.org/dist/RestrictionDigest/ LICENSE AND COPYRIGHT Copyright (C) 2015 Jinpeng Wang, Li Li, Haigang Qi, Xuedi Du, and Guofan Zhang This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at: L Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license. If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license. This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder. This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed. Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.