← Index
NYTProf Performance Profile   « block view • line view • sub view »
For bin/pan_genome_post_analysis
  Run on Fri Mar 27 11:43:32 2015
Reported on Fri Mar 27 11:45:27 2015

Filename/Users/ap13/perl5/lib/perl5/Bio/Tools/GFF.pm
StatementsExecuted 18 statements in 4.66ms
Subroutines
Calls P F Exclusive
Time
Inclusive
Time
Subroutine
1112.87ms16.6msBio::Tools::GFF::::BEGIN@152Bio::Tools::GFF::BEGIN@152
1112.28ms31.9msBio::Tools::GFF::::BEGIN@151Bio::Tools::GFF::BEGIN@151
111389µs37.5msBio::Tools::GFF::::BEGIN@150Bio::Tools::GFF::BEGIN@150
11129µs78µsBio::Tools::GFF::::BEGIN@147Bio::Tools::GFF::BEGIN@147
11117µs46µsBio::Tools::GFF::::BEGIN@148Bio::Tools::GFF::BEGIN@148
11112µs601µsBio::Tools::GFF::::BEGIN@154Bio::Tools::GFF::BEGIN@154
0000s0sBio::Tools::GFF::::DESTROYBio::Tools::GFF::DESTROY
0000s0sBio::Tools::GFF::::PRINTBio::Tools::GFF::PRINT
0000s0sBio::Tools::GFF::::READLINEBio::Tools::GFF::READLINE
0000s0sBio::Tools::GFF::::TIEHANDLEBio::Tools::GFF::TIEHANDLE
0000s0sBio::Tools::GFF::::_feature_idx_by_seq_idBio::Tools::GFF::_feature_idx_by_seq_id
0000s0sBio::Tools::GFF::::_from_gff1_stringBio::Tools::GFF::_from_gff1_string
0000s0sBio::Tools::GFF::::_from_gff2_stringBio::Tools::GFF::_from_gff2_string
0000s0sBio::Tools::GFF::::_from_gff3_stringBio::Tools::GFF::_from_gff3_string
0000s0sBio::Tools::GFF::::_gff1_stringBio::Tools::GFF::_gff1_string
0000s0sBio::Tools::GFF::::_gff25_stringBio::Tools::GFF::_gff25_string
0000s0sBio::Tools::GFF::::_gff2_stringBio::Tools::GFF::_gff2_string
0000s0sBio::Tools::GFF::::_gff3_stringBio::Tools::GFF::_gff3_string
0000s0sBio::Tools::GFF::::_incrementGFF3IDBio::Tools::GFF::_incrementGFF3ID
0000s0sBio::Tools::GFF::::_parse_headerBio::Tools::GFF::_parse_header
0000s0sBio::Tools::GFF::::_parse_sequenceBio::Tools::GFF::_parse_sequence
0000s0sBio::Tools::GFF::::_seq_by_id_hBio::Tools::GFF::_seq_by_id_h
0000s0sBio::Tools::GFF::::features_attached_to_seqsBio::Tools::GFF::features_attached_to_seqs
0000s0sBio::Tools::GFF::::fhBio::Tools::GFF::fh
0000s0sBio::Tools::GFF::::from_gff_stringBio::Tools::GFF::from_gff_string
0000s0sBio::Tools::GFF::::get_seqsBio::Tools::GFF::get_seqs
0000s0sBio::Tools::GFF::::gff_stringBio::Tools::GFF::gff_string
0000s0sBio::Tools::GFF::::gff_versionBio::Tools::GFF::gff_version
0000s0sBio::Tools::GFF::::ignore_sequenceBio::Tools::GFF::ignore_sequence
0000s0sBio::Tools::GFF::::newBio::Tools::GFF::new
0000s0sBio::Tools::GFF::::newFhBio::Tools::GFF::newFh
0000s0sBio::Tools::GFF::::next_featureBio::Tools::GFF::next_feature
0000s0sBio::Tools::GFF::::next_segmentBio::Tools::GFF::next_segment
0000s0sBio::Tools::GFF::::unescapeBio::Tools::GFF::unescape
0000s0sBio::Tools::GFF::::write_featureBio::Tools::GFF::write_feature
Call graph for these subroutines as a Graphviz dot language file.
Line State
ments
Time
on line
Calls Time
in subs
Code
1#
2# BioPerl module for Bio::Tools::GFF
3#
4# Please direct questions and support issues to <bioperl-l@bioperl.org>
5#
6# Cared for by the Bioperl core team
7#
8# Copyright Matthew Pocock
9#
10# You may distribute this module under the same terms as perl itself
11
12# POD documentation - main docs before the code
13
14=head1 NAME
15
16Bio::Tools::GFF - A Bio::SeqAnalysisParserI compliant GFF format parser
17
18=head1 SYNOPSIS
19
20 use Bio::Tools::GFF;
21
22 # specify input via -fh or -file
23 my $gffio = Bio::Tools::GFF->new(-fh => \*STDIN, -gff_version => 2);
24 my $feature;
25 # loop over the input stream
26 while($feature = $gffio->next_feature()) {
27 # do something with feature
28 }
29 $gffio->close();
30
31 # you can also obtain a GFF parser as a SeqAnalasisParserI in
32 # HT analysis pipelines (see Bio::SeqAnalysisParserI and
33 # Bio::Factory::SeqAnalysisParserFactory)
34 my $factory = Bio::Factory::SeqAnalysisParserFactory->new();
35 my $parser = $factory->get_parser(-input => \*STDIN, -method => "gff");
36 while($feature = $parser->next_feature()) {
37 # do something with feature
38 }
39
40=head1 DESCRIPTION
41
42This class provides a simple GFF parser and writer. In the sense of a
43SeqAnalysisParser, it parses an input file or stream into SeqFeatureI
44objects, but is not in any way specific to a particular analysis
45program and the output that program produces.
46
47That is, if you can get your analysis program spit out GFF, here is
48your result parser.
49
50=head1 GFF3 AND SEQUENCE DATA
51
52GFF3 supports sequence data; see
53
54http://www.sequenceontology.org/gff3.shtml
55
56There are a number of ways to deal with this -
57
58If you call
59
60 $gffio->ignore_sequence(1)
61
62prior to parsing the sequence data is ignored; this is useful if you
63just want the features. It avoids the memory overhead in building and
64caching sequences
65
66Alternatively, you can call either
67
68 $gffio->get_seqs()
69
70Or
71
72 $gffio->seq_id_by_h()
73
74At the B<end> of parsing to get either a list or hashref of Bio::Seq
75objects (see the documentation for each of these methods)
76
77Note that these objects will not have the features attached - you have
78to do this yourself, OR call
79
80 $gffio->features_attached_to_seqs(1)
81
82PRIOR to parsing; this will ensure that the Seqs have the features
83attached; ie you will then be able to call
84
85 $seq->get_SeqFeatures();
86
87And use Bio::SeqIO methods
88
89Note that auto-attaching the features to seqs will incur a higher
90memory overhead as the features must be cached until the sequence data
91is found
92
93=head1 TODO
94
95Make a Bio::SeqIO class specifically for GFF3 with sequence data
96
97=head1 FEEDBACK
98
99=head2 Mailing Lists
100
101User feedback is an integral part of the evolution of this and other
102Bioperl modules. Send your comments and suggestions preferably to one
103of the Bioperl mailing lists. Your participation is much appreciated.
104
105 bioperl-l@bioperl.org - General discussion
106 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
107
108=head2 Support
109
110Please direct usage questions or support issues to the mailing list:
111
112I<bioperl-l@bioperl.org>
113
114rather than to the module maintainer directly. Many experienced and
115reponsive experts will be able look at the problem and quickly
116address it. Please include a thorough description of the problem
117with code and data examples if at all possible.
118
119=head2 Reporting Bugs
120
121Report bugs to the Bioperl bug tracking system to help us keep track
122the bugs and their resolution. Bug reports can be submitted the web:
123
124 https://github.com/bioperl/bioperl-live/issues
125
126=head1 AUTHOR - Matthew Pocock
127
128Email mrp-at-sanger.ac.uk
129
130=head1 CONTRIBUTORS
131
132Jason Stajich, jason-at-biperl-dot-org
133Chris Mungall, cjm-at-fruitfly-dot-org
134Steffen Grossmann [SG], grossman at molgen.mpg.de
135Malcolm Cook, mec-at-stowers-institute.org
136
137=head1 APPENDIX
138
139The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _
140
141=cut
142
143# Let the code begin...
144
145package Bio::Tools::GFF;
146
147256µs2127µs
# spent 78µs (29+49) within Bio::Tools::GFF::BEGIN@147 which was called: # once (29µs+49µs) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 147
use vars qw($HAS_HTML_ENTITIES);
# spent 78µs making 1 call to Bio::Tools::GFF::BEGIN@147 # spent 49µs making 1 call to vars::import
148244µs275µs
# spent 46µs (17+29) within Bio::Tools::GFF::BEGIN@148 which was called: # once (17µs+29µs) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 148
use strict;
# spent 46µs making 1 call to Bio::Tools::GFF::BEGIN@148 # spent 29µs making 1 call to strict::import
149
1502180µs137.5ms
# spent 37.5ms (389µs+37.1) within Bio::Tools::GFF::BEGIN@150 which was called: # once (389µs+37.1ms) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 150
use Bio::Seq::SeqFactory;
# spent 37.5ms making 1 call to Bio::Tools::GFF::BEGIN@150
1512147µs131.9ms
# spent 31.9ms (2.28+29.6) within Bio::Tools::GFF::BEGIN@151 which was called: # once (2.28ms+29.6ms) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 151
use Bio::LocatableSeq;
# spent 31.9ms making 1 call to Bio::Tools::GFF::BEGIN@151
1522202µs116.6ms
# spent 16.6ms (2.87+13.7) within Bio::Tools::GFF::BEGIN@152 which was called: # once (2.87ms+13.7ms) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 152
use Bio::SeqFeature::Generic;
# spent 16.6ms making 1 call to Bio::Tools::GFF::BEGIN@152
153
15424.01ms21.19ms
# spent 601µs (12+589) within Bio::Tools::GFF::BEGIN@154 which was called: # once (12µs+589µs) by Bio::Roary::GeneNamesFromGFF::BEGIN@19 at line 154
use base qw(Bio::Root::Root Bio::SeqAnalysisParserI Bio::Root::IO);
# spent 601µs making 1 call to Bio::Tools::GFF::BEGIN@154 # spent 589µs making 1 call to base::import
155
1561500nsmy $i = 0;
15716µsmy %GFF3_ID_Tags = map { $_ => $i++ } qw(ID Parent Target);
158
159# for skipping data that may be represented elsewhere; currently, this is
160# only the score
16111µsmy %SKIPPED_TAGS = map { $_ => 1 } qw(score);
162
163
164=head2 new
165
166 Title : new
167 Usage : my $parser = Bio::Tools::GFF->new(-gff_version => 2,
168 -file => "filename.gff");
169 or
170 my $writer = Bio::Tools::GFF->new(-gff_version => 3,
171 -file => ">filename.gff3");
172 Function: Creates a new instance. Recognized named parameters are -file, -fh,
173 and -gff_version.
174 Returns : a new object
175 Args : named parameters
176 -gff_version => [1,2,3]
177
178=cut
179
180{ # make a class variable such that we can generate unique ID's over
181 # a session, no matter how many instances of GFF.pm we make
182 # since these have to be unique within the scope of a GFF file.
183
1842800ns my $gff3_featureID = 0;
185
186 sub _incrementGFF3ID {
187 my ($self) = @_;
188 return ++ $gff3_featureID;
189 }
190}
191
192
193sub new {
194 my ($class, @args) = @_;
195 my $self = $class->SUPER::new(@args);
196
197 my ($gff_version, $noparse) = $self->_rearrange([qw(GFF_VERSION NOPARSE)],@args);
198
199 # initialize IO
200 $self->_initialize_io(@args);
201 $self->_parse_header() unless $noparse;
202
203 $gff_version ||= 2;
204 if( ! $self->gff_version($gff_version) ) {
205 $self->throw("Can't build a GFF object with the unknown version ".
206 $gff_version);
207 }
208 $self->{'_first'} = 1;
209 return $self;
210}
211
212
213=head2 _parse_header
214
215 Title : _parse_header
216 Usage : $gffio->_parse_header()
217 Function: used to turn parse GFF header lines. currently
218 produces Bio::LocatableSeq objects from ##sequence-region
219 lines
220 Returns : 1 on success
221 Args : none
222
223
224=cut
225
226sub _parse_header{
227 my ($self) = @_;
228
229 my @unhandled;
230 local $^W = 0; # hide warnings when we try and parse from a file opened
231 # for writing - there isn't really a better way to do
232 # AFAIK - cannot detech if a FH is read or write.
233 while(my $line = $self->_readline()){
234 my $handled = 0;
235 next if /^\s+$/;
236 if($line =~ /^\#\#sequence-region\s+(\S+)\s+(\S+)\s+(\S+)\s*/){
237 my($seqid,$start,$end) = ($1,$2,$3);
238 push @{ $self->{'segments'} }, Bio::LocatableSeq->new(
239 -id => unescape($seqid),
240 -start => $start,
241 -end => $end,
242 -length => ($end - $start + 1), ## make the length explicit
243 );
244 $handled = 1;
245 } elsif($line =~ /^(\#\#feature-ontology)/) {
246 #to be implemented
247 $self->warn("$1 header tag parsing unimplemented");
248 } elsif($line =~ /^(\#\#attribute-ontology)/) {
249 #to be implemented
250 $self->warn("$1 header tag parsing unimplemented");
251 } elsif($line =~ /^(\#\#source-ontology)/) {
252 #to be implemented
253 $self->warn("$1 header tag parsing unimplemented");
254 } elsif($line =~ /^(\#\#\#)/) {
255 #to be implemented
256 $self->warn("$1 header tag parsing unimplemented");
257 } elsif($line =~ /^(\#\#FASTA)/) {
258 # initial ##FASTA is optional - artemis does not use it
259 $line = $self->_readline();
260 if ($line !~ /^\>(\S+)/) {
261 $self->throw("##FASTA directive must be followed by fasta header, not: $line");
262 }
263 }
264
265 if ($line =~ /^\>(.*)/) {
266 # seq data can be at header or footer
267 my $seq = $self->_parse_sequence($line);
268 if ($seq) {
269 $self->_seq_by_id_h->{$seq->primary_id} = $seq;
270 }
271 }
272
273
274 if(!$handled){
275 push @unhandled, $line;
276 }
277
278 #looks like the header is over!
279 last unless $line =~ /^\#/;
280 }
281
282 foreach my $line (@unhandled){
283 $self->_pushback($line);
284 }
285
286 return 1;
287}
288
289sub _parse_sequence {
290 my ($self, $line) = @_;
291
292 if ($line =~ /^\>(.*)/) {
293
294 my $seqid = $1;
295 $seqid =~ s/\s+$//;
296 my $desc = '';
297 if ($seqid =~ /(\S+)\s+(.*)/) {
298 ($seqid, $desc) = ($1,$2);
299 }
300 my $res = '';
301 while (my $line = $self->_readline) {
302 if ($line =~ /^\#/) {
303 last;
304 }
305 if ($line =~ /^\>/) {
306 $self->_pushback($line);
307 last;
308 }
309 $line =~ s/\s//g;
310 $res .= $line;
311 }
312 return if $self->ignore_sequence;
313
314 my $seqfactory = Bio::Seq::SeqFactory->new('Bio::Seq');
315 my $seq = $seqfactory->create(-seq=>$res,
316 -id=>$seqid,
317 -desc=>$desc);
318 $seq->accession_number($seqid);
319 if ($self->features_attached_to_seqs) {
320 my @feats =
321 @{$self->_feature_idx_by_seq_id->{$seqid}};
322 $seq->add_SeqFeature($_) foreach @feats;
323 @{$self->_feature_idx_by_seq_id->{$seqid}} = ();
324 }
325 return $seq;
326 }
327 else {
328 $self->throw("expected fasta header, not: $line");
329 }
330}
331
332
333=head2 next_segment
334
335 Title : next_segment
336 Usage : my $seq = $gffio->next_segment;
337 Function: Returns a Bio::LocatableSeq object corresponding to a
338 GFF "##sequence-region" header line.
339 Example :
340 Returns : A Bio::LocatableSeq object, or undef if
341 there are no more sequences.
342 Args : none
343
344
345=cut
346
347sub next_segment{
348 my ($self,@args) = @_;
349 return shift @{ $self->{'segments'} } if defined $self->{'segments'};
350 return;
351}
352
353
354=head2 next_feature
355
356 Title : next_feature
357 Usage : $seqfeature = $gffio->next_feature();
358 Function: Returns the next feature available in the input file or stream, or
359 undef if there are no more features.
360 Example :
361 Returns : A Bio::SeqFeatureI implementing object, or undef if there are no
362 more features.
363 Args : none
364
365=cut
366
367sub next_feature {
368 my ($self) = @_;
369
370 my $gff_string;
371
372 # be graceful about empty lines or comments, and make sure we return undef
373 # if the input's consumed
374 while(($gff_string = $self->_readline()) && defined($gff_string)) {
375 if ($gff_string =~ /^\#\#\#/) {
376 # all forward refs have been seen; TODO
377 }
378 next if($gff_string =~ /^\#/ || $gff_string =~ /^\s*$/ ||
379 $gff_string =~ m{^//});
380
381 while ($gff_string =~ /^\>(.+)/) {
382 # fasta can be in header or footer
383 my $seq = $self->_parse_sequence($gff_string);
384 if ($seq) {
385 $self->_seq_by_id_h->{$seq->primary_id} = $seq;
386 $gff_string = $self->_readline;
387 last unless $gff_string;
388 }
389 }
390 last;
391 }
392 return unless $gff_string;
393
394 my $feat = Bio::SeqFeature::Generic->new();
395 $self->from_gff_string($feat, $gff_string);
396
397 if ($self->features_attached_to_seqs) {
398 push(@{$self->_feature_idx_by_seq_id->{$feat->seq_id}},
399 $feat);
400 }
401
402 return $feat;
403}
404
405sub _feature_idx_by_seq_id {
406 my $self = shift;
407 $self->{__feature_idx_by_seq_id} = shift if @_;
408 $self->{__feature_idx_by_seq_id} = {}
409 unless $self->{__feature_idx_by_seq_id};
410 return $self->{__feature_idx_by_seq_id};
411}
412
413
414=head2 from_gff_string
415
416 Title : from_gff_string
417 Usage : $gff->from_gff_string($feature, $gff_string);
418 Function: Sets properties of a SeqFeatureI object from a GFF-formatted
419 string. Interpretation of the string depends on the version
420 that has been specified at initialization.
421
422 This method is used by next_feature(). It actually dispatches to
423 one of the version-specific (private) methods.
424 Example :
425 Returns : void
426 Args : A Bio::SeqFeatureI implementing object to be initialized
427 The GFF-formatted string to initialize it from
428
429=cut
430
431sub from_gff_string {
432 my ($self, $feat, $gff_string) = @_;
433
434 if($self->gff_version() == 1) {
435 return $self->_from_gff1_string($feat, $gff_string);
436 } elsif( $self->gff_version() == 3 ) {
437 return $self->_from_gff3_string($feat, $gff_string);
438 } else {
439 return $self->_from_gff2_string($feat, $gff_string);
440 }
441}
442
443
444=head2 _from_gff1_string
445
446 Title : _from_gff1_string
447 Usage :
448 Function:
449 Example :
450 Returns : void
451 Args : A Bio::SeqFeatureI implementing object to be initialized
452 The GFF-formatted string to initialize it from
453
454=cut
455
456sub _from_gff1_string {
457 my ($gff, $feat, $string) = @_;
458 chomp $string;
459 my ($seqname, $source, $primary, $start, $end, $score,
460 $strand, $frame, @group) = split(/\t/, $string);
461
462 if ( !defined $frame ) {
463 $feat->throw("[$string] does not look like GFF to me");
464 }
465 $frame = 0 unless( $frame =~ /^\d+$/);
466 $feat->seq_id($seqname);
467 $feat->source_tag($source);
468 $feat->primary_tag($primary);
469 $feat->start($start);
470 $feat->end($end);
471 $feat->frame($frame);
472 if ( $score eq '.' ) {
473 #$feat->score(undef);
474 } else {
475 $feat->score($score);
476 }
477 if ( $strand eq '-' ) { $feat->strand(-1); }
478 if ( $strand eq '+' ) { $feat->strand(1); }
479 if ( $strand eq '.' ) { $feat->strand(0); }
480 foreach my $g ( @group ) {
481 if ( $g =~ /(\S+)=(\S+)/ ) {
482 my $tag = $1;
483 my $value = $2;
484 $feat->add_tag_value($1, $2);
485 } else {
486 $feat->add_tag_value('group', $g);
487 }
488 }
489}
490
491
492=head2 _from_gff2_string
493
494 Title : _from_gff2_string
495 Usage :
496 Function:
497 Example :
498 Returns : void
499 Args : A Bio::SeqFeatureI implementing object to be initialized
500 The GFF2-formatted string to initialize it from
501
502
503=cut
504
505sub _from_gff2_string {
506 my ($gff, $feat, $string) = @_;
507 chomp($string);
508
509 # according to the Sanger website, GFF2 should be single-tab
510 # separated elements, and the free-text at the end should contain
511 # text-translated tab symbols but no "real" tabs, so splitting on
512 # \t is safe, and $attribs gets the entire attributes field to be
513 # parsed later
514
515 # sendu: but the tag value pair can (should?) be separated by a tab. The
516 # 'no tabs' thing seems to apply only to the free text that is allowed for
517 # the value
518
519 my ($seqname, $source, $primary, $start,
520 $end, $score, $strand, $frame, @attribs) = split(/\t+/, $string);
521 my $attribs = join ' ', @attribs;
522
523 if ( !defined $frame ) {
524 $feat->throw("[$string] does not look like GFF2 to me");
525 }
526 $feat->seq_id($seqname);
527 $feat->source_tag($source);
528 $feat->primary_tag($primary);
529 $feat->start($start);
530 $feat->end($end);
531 $feat->frame($frame);
532 if ( $score eq '.' ) {
533 # $feat->score(undef);
534 } else {
535 $feat->score($score);
536 }
537 if ( $strand eq '-' ) { $feat->strand(-1); }
538 if ( $strand eq '+' ) { $feat->strand(1); }
539 if ( $strand eq '.' ) { $feat->strand(0); }
540
541
542 # <Begin Inefficient Code from Mark Wilkinson>
543 # this routine is necessay to allow the presence of semicolons in
544 # quoted text Semicolons are the delimiting character for new
545 # tag/value attributes. it is more or less a "state" machine, with
546 # the "quoted" flag going up and down as we pass thorugh quotes to
547 # distinguish free-text semicolon and hash symbols from GFF control
548 # characters
549
550 my $flag = 0; # this could be changed to a bit and just be twiddled
551 my @parsed;
552
553 # run through each character one at a time and check it
554 # NOTE: changed to foreach loop which is more efficient in perl
555 # --jasons
556 for my $a ( split //, $attribs ) {
557 # flag up on entering quoted text, down on leaving it
558 if( $a eq '"') { $flag = ( $flag == 0 ) ? 1:0 }
559 elsif( $a eq ';' && $flag ) { $a = "INSERT_SEMICOLON_HERE"}
560 elsif( $a eq '#' && ! $flag ) { last }
561 push @parsed, $a;
562 }
563 $attribs = join "", @parsed; # rejoin into a single string
564
565 # <End Inefficient Code>
566 # Please feel free to fix this and make it more "perlish"
567
568 my @key_vals = split /;/, $attribs; # attributes are semicolon-delimited
569
570 foreach my $pair ( @key_vals ) {
571 # replace semicolons that were removed from free-text above.
572 $pair =~ s/INSERT_SEMICOLON_HERE/;/g;
573
574 # separate the key from the value
575 my ($blank, $key, $values) = split /^\s*([\w\d]+)\s/, $pair;
576
577 if( defined $values ) {
578 my @values;
579 # free text is quoted, so match each free-text block
580 # and remove it from the $values string
581 while ($values =~ s/"(.*?)"//){
582 # and push it on to the list of values (tags may have
583 # more than one value... and the value may be undef)
584 push @values, $1;
585 }
586
587 # and what is left over should be space-separated
588 # non-free-text values
589
590 my @othervals = split /\s+/, $values;
591 foreach my $othervalue(@othervals){
592 # get rid of any empty strings which might
593 # result from the split
594 if (CORE::length($othervalue) > 0) {push @values, $othervalue}
595 }
596
597 foreach my $value(@values){
598 $feat->add_tag_value($key, $value);
599 }
600 }
601 }
602}
603
604
605sub _from_gff3_string {
606 my ($gff, $feat, $string) = @_;
607 chomp($string);
608
609 # according to the now nearly final GFF3 spec, columns should
610 # be tab separated, allowing unescaped spaces to occur in
611 # column 9
612
613 my ($seqname, $source, $primary, $start, $end,
614 $score, $strand, $frame, $groups) = split(/\t/, $string);
615
616 if ( ! defined $frame ) {
617 $feat->throw("[$string] does not look like GFF3 to me");
618 }
619 $feat->seq_id($seqname);
620 $feat->source_tag($source);
621 $feat->primary_tag($primary);
622 $feat->start($start);
623 $feat->end($end);
624 $feat->frame($frame);
625 if ( $score eq '.' ) {
626 #$feat->score(undef);
627 } else {
628 $feat->score($score);
629 }
630 if ( $strand eq '-' ) { $feat->strand(-1); }
631 if ( $strand eq '+' ) { $feat->strand(1); }
632 if ( $strand eq '.' ) { $feat->strand(0); }
633 my @groups = split(/\s*;\s*/, $groups);
634
635 for my $group (@groups) {
636 my ($tag,$value) = split /=/,$group;
637 $tag = unescape($tag);
638 my @values = map {unescape($_)} split /,/,$value;
639 for my $v ( @values ) { $feat->add_tag_value($tag,$v); }
640 }
641}
642
643# taken from Bio::DB::GFF
644sub unescape {
645 my $v = shift;
646 $v =~ tr/+/ /;
647 $v =~ s/%([0-9a-fA-F]{2})/chr hex($1)/ge;
648 return $v;
649}
650
651
652=head2 write_feature
653
654 Title : write_feature
655 Usage : $gffio->write_feature($feature);
656 Function: Writes the specified SeqFeatureI object in GFF format to the stream
657 associated with this instance.
658 Returns : none
659 Args : An array of Bio::SeqFeatureI implementing objects to be serialized
660
661=cut
662
663sub write_feature {
664 my ($self, @features) = @_;
665 return unless @features;
666 if( $self->{'_first'} && $self->gff_version() == 3 ) {
667 $self->_print("##gff-version 3\n");
668 }
669 $self->{'_first'} = 0;
670 foreach my $feature ( @features ) {
671 $self->_print($self->gff_string($feature)."\n");
672 }
673}
674
675
676=head2 gff_string
677
678 Title : gff_string
679 Usage : $gffstr = $gffio->gff_string($feature);
680 Function: Obtain the GFF-formatted representation of a SeqFeatureI object.
681 The formatting depends on the version specified at initialization.
682
683 This method is used by write_feature(). It actually dispatches to
684 one of the version-specific (private) methods.
685 Example :
686 Returns : A GFF-formatted string representation of the SeqFeature
687 Args : A Bio::SeqFeatureI implementing object to be GFF-stringified
688
689=cut
690
691sub gff_string{
692 my ($self, $feature) = @_;
693
694 if($self->gff_version() == 1) {
695 return $self->_gff1_string($feature);
696 } elsif( $self->gff_version() == 3 ) {
697 return $self->_gff3_string($feature);
698 } elsif( $self->gff_version() == 2.5 ) {
699 return $self->_gff25_string($feature);
700 } else {
701 return $self->_gff2_string($feature);
702 }
703}
704
705
706=head2 _gff1_string
707
708 Title : _gff1_string
709 Usage : $gffstr = $gffio->_gff1_string
710 Function:
711 Example :
712 Returns : A GFF1-formatted string representation of the SeqFeature
713 Args : A Bio::SeqFeatureI implementing object to be GFF-stringified
714
715=cut
716
717sub _gff1_string{
718 my ($gff, $feat) = @_;
719 my ($str,$score,$frame,$name,$strand);
720
721 if( $feat->can('score') ) {
722 $score = $feat->score();
723 }
724 $score = '.' unless defined $score;
725
726 if( $feat->can('frame') ) {
727 $frame = $feat->frame();
728 }
729 $frame = '.' unless defined $frame;
730
731 $strand = $feat->strand();
732 if(! $strand) {
733 $strand = ".";
734 } elsif( $strand == 1 ) {
735 $strand = '+';
736 } elsif ( $feat->strand == -1 ) {
737 $strand = '-';
738 }
739
740 if( $feat->can('seqname') ) {
741 $name = $feat->seq_id();
742 $name ||= 'SEQ';
743 } else {
744 $name = 'SEQ';
745 }
746
747 $str = join("\t",
748 $name,
749 $feat->source_tag,
750 $feat->primary_tag,
751 $feat->start,
752 $feat->end,
753 $score,
754 $strand,
755 $frame);
756
757 foreach my $tag ( $feat->get_all_tags ) {
758 next if exists $SKIPPED_TAGS{$tag};
759 foreach my $value ( $feat->get_tag_values($tag) ) {
760 $str .= " $tag=$value" if $value;
761 }
762 }
763
764 return $str;
765}
766
767
768=head2 _gff2_string
769
770 Title : _gff2_string
771 Usage : $gffstr = $gffio->_gff2_string
772 Function:
773 Example :
774 Returns : A GFF2-formatted string representation of the SeqFeature
775 Args : A Bio::SeqFeatureI implementing object to be GFF2-stringified
776
777=cut
778
779sub _gff2_string{
780 my ($gff, $origfeat) = @_;
781 my $feat;
782 if ($origfeat->isa('Bio::SeqFeature::FeaturePair')){
783 $feat = $origfeat->feature2;
784 } else {
785 $feat = $origfeat;
786 }
787 my ($str1, $str2,$score,$frame,$name,$strand);
788
789 if( $feat->can('score') ) {
790 $score = $feat->score();
791 }
792 $score = '.' unless defined $score;
793
794 if( $feat->can('frame') ) {
795 $frame = $feat->frame();
796 }
797 $frame = '.' unless defined $frame;
798
799 $strand = $feat->strand();
800 if(! $strand) {
801 $strand = ".";
802 } elsif( $strand == 1 ) {
803 $strand = '+';
804 } elsif ( $feat->strand == -1 ) {
805 $strand = '-';
806 }
807
808 if( $feat->can('seqname') ) {
809 $name = $feat->seq_id();
810 }
811 $name ||= 'SEQ';
812
813 $str1 = join("\t",
814 $name,
815 $feat->source_tag(),
816 $feat->primary_tag(),
817 $feat->start(),
818 $feat->end(),
819 $score,
820 $strand,
821 $frame);
822 # the routine below is the only modification I made to the original
823 # ->gff_string routine (above) as on November 17th, 2000, the
824 # Sanger webpage describing GFF2 format reads: "From version 2
825 # onwards, the attribute field must have a tag value structure
826 # following the syntax used within objects in a .ace file,
827 # flattened onto one line by semicolon separators. Tags must be
828 # standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values
829 # must be quoted with double quotes".
830 # MW
831
832 my @group;
833
834 foreach my $tag ( $feat->get_all_tags ) {
835 next if exists $SKIPPED_TAGS{$tag};
836 my @v;
837 foreach my $value ( $feat->get_tag_values($tag) ) {
838 unless( defined $value && length($value) ) {
839 # quote anything other than valid tag/value characters
840 $value = '""';
841 } elsif ($value =~ /[^A-Za-z0-9_]/){
842 # substitute tab and newline chars by their UNIX equivalent
843 $value =~ s/\t/\\t/g;
844 $value =~ s/\n/\\n/g;
845 $value = '"' . $value . '" ';
846 }
847 push @v, $value;
848 # for this tag (allowed in GFF2 and .ace format)
849 }
850 push @group, "$tag ".join(" ", @v);
851 }
852
853 $str2 .= join(' ; ', @group);
854 # Add Target information for Feature Pairs
855 if( ! $feat->has_tag('Target') && # This is a bad hack IMHO
856 ! $feat->has_tag('Group') &&
857 $origfeat->isa('Bio::SeqFeature::FeaturePair') ) {
858 $str2 = sprintf("Target %s %d %d", $origfeat->feature1->seq_id,
859 ( $origfeat->feature1->strand < 0 ?
860 ( $origfeat->feature1->end,
861 $origfeat->feature1->start) :
862 ( $origfeat->feature1->start,
863 $origfeat->feature1->end)
864 )) . ($str2?" ; ".$str2:""); # need to put Target information before other tag/value pairs - mw
865 }
866 return $str1."\t".$str2;
867}
868
869
870=head2 _gff25_string
871
872 Title : _gff25_string
873 Usage : $gffstr = $gffio->_gff2_string
874 Function: To get a format of GFF that is peculiar to Gbrowse/Bio::DB::GFF
875 Example :
876 Returns : A GFF2.5-formatted string representation of the SeqFeature
877 Args : A Bio::SeqFeatureI implementing object to be GFF2.5-stringified
878
879=cut
880
881sub _gff25_string {
882 my ($gff, $origfeat) = @_;
883 my $feat;
884 if ($origfeat->isa('Bio::SeqFeature::FeaturePair')){
885 $feat = $origfeat->feature2;
886 } else {
887 $feat = $origfeat;
888 }
889 my ($str1, $str2,$score,$frame,$name,$strand);
890
891 if( $feat->can('score') ) {
892 $score = $feat->score();
893 }
894 $score = '.' unless defined $score;
895
896 if( $feat->can('frame') ) {
897 $frame = $feat->frame();
898 }
899 $frame = '.' unless defined $frame;
900
901 $strand = $feat->strand();
902 if(! $strand) {
903 $strand = ".";
904 } elsif( $strand == 1 ) {
905 $strand = '+';
906 } elsif ( $feat->strand == -1 ) {
907 $strand = '-';
908 }
909
910 if( $feat->can('seqname') ) {
911 $name = $feat->seq_id();
912 $name ||= 'SEQ';
913 } else {
914 $name = 'SEQ';
915 }
916 $str1 = join("\t",
917 $name,
918 $feat->source_tag(),
919 $feat->primary_tag(),
920 $feat->start(),
921 $feat->end(),
922 $score,
923 $strand,
924 $frame);
925
926 my @all_tags = $feat->all_tags;
927 my @group; my @firstgroup;
928 if (@all_tags) { # only play this game if it is worth playing...
929 foreach my $tag ( @all_tags ) {
930 my @v;
931 foreach my $value ( $feat->get_tag_values($tag) ) {
932 next if exists $SKIPPED_TAGS{$tag};
933 unless( defined $value && length($value) ) {
934 $value = '""';
935 } elsif ($value =~ /[^A-Za-z0-9_]/){
936 $value =~ s/\t/\\t/g; # substitute tab and newline
937 # characters
938 $value =~ s/\n/\\n/g; # to their UNIX equivalents
939 $value = '"' . $value . '" ';
940 } # if the value contains
941 # anything other than valid
942 # tag/value characters, then
943 # quote it
944 push @v, $value;
945 # for this tag (allowed in GFF2 and .ace format)
946 }
947 if (($tag eq 'Group') || ($tag eq 'Target')){ # hopefully we wont get both...
948 push @firstgroup, "$tag ".join(" ", @v);
949 } else {
950 push @group, "$tag ".join(" ", @v);
951 }
952 }
953 }
954 $str2 = join(' ; ', (@firstgroup, @group));
955 # Add Target information for Feature Pairs
956 if( ! $feat->has_tag('Target') && # This is a bad hack IMHO
957 ! $feat->has_tag('Group') &&
958 $origfeat->isa('Bio::SeqFeature::FeaturePair') ) {
959 $str2 = sprintf("Target %s ; tstart %d ; tend %d", $origfeat->feature1->seq_id,
960 ( $origfeat->feature1->strand < 0 ?
961 ( $origfeat->feature1->end,
962 $origfeat->feature1->start) :
963 ( $origfeat->feature1->start,
964 $origfeat->feature1->end)
965 )) . ($str2?" ; ".$str2:""); # need to put the target info before other tag/value pairs - mw
966 }
967 return $str1 . "\t". $str2;
968}
969
970
971=head2 _gff3_string
972
973 Title : _gff3_string
974 Usage : $gffstr = $gffio->_gff3_string
975 Function:
976 Example :
977 Returns : A GFF3-formatted string representation of the SeqFeature
978 Args : A Bio::SeqFeatureI implementing object to be GFF3-stringified
979
980=cut
981
982sub _gff3_string {
983 my ($gff, $origfeat) = @_;
984 my $feat;
985 if ($origfeat->isa('Bio::SeqFeature::FeaturePair')){
986 $feat = $origfeat->feature2;
987 } else {
988 $feat = $origfeat;
989 }
990
991 my $ID = $gff->_incrementGFF3ID();
992
993 my ($score,$frame,$name,$strand);
994
995 if( $feat->can('score') ) {
996 $score = $feat->score();
997 }
998 $score = '.' unless defined $score;
999
1000 if( $feat->can('frame') ) {
1001 $frame = $feat->frame();
1002 }
1003 $frame = '1' unless defined $frame;
1004
1005 $strand = $feat->strand();
1006
1007 if(! $strand) {
1008 $strand = ".";
1009 } elsif( $strand == 1 ) {
1010 $strand = '+';
1011 } elsif ( $feat->strand == -1 ) {
1012 $strand = '-';
1013 }
1014
1015 if( $feat->can('seqname') ) {
1016 $name = $feat->seq_id();
1017 $name ||= 'SEQ';
1018 } else {
1019 $name = 'SEQ';
1020 }
1021
1022 my @groups;
1023
1024 # force leading ID and Parent tags
1025 my @all_tags = grep { ! exists $GFF3_ID_Tags{$_} } $feat->all_tags;
1026 for my $t ( sort { $GFF3_ID_Tags{$b} <=> $GFF3_ID_Tags{$a} }
1027 keys %GFF3_ID_Tags ) {
1028 unshift @all_tags, $t if $feat->has_tag($t);
1029 }
1030
1031 for my $tag ( @all_tags ) {
1032 next if exists $SKIPPED_TAGS{$tag};
1033 # next if $tag eq 'Target';
1034 if ($tag eq 'Target' && ! $origfeat->isa('Bio::SeqFeature::FeaturePair')){
1035 # simple Target,start,stop
1036 my($target_id, $b,$e,$strand) = $feat->get_tag_values($tag);
1037 next unless(defined($e) && defined($b) && $target_id);
1038 ($b,$e)= ($e,$b) if(defined $strand && $strand<0);
1039 $target_id =~ s/([\t\n\r%&\=;,])/sprintf("%%%X",ord($1))/ge;
1040 push @groups, sprintf("Target=%s %d %d", $target_id,$b,$e);
1041 next;
1042 }
1043
1044 my $valuestr;
1045 # a string which will hold one or more values
1046 # for this tag, with quoted free text and
1047 # space-separated individual values.
1048 my @v;
1049 for my $value ( $feat->get_tag_values($tag) ) {
1050 if( defined $value && length($value) ) {
1051 #$value =~ tr/ /+/; #spaces are allowed now
1052 if ( ref $value eq 'Bio::Annotation::Comment') {
1053 $value = $value->text;
1054 }
1055
1056 if ($value =~ /[^a-zA-Z0-9\,\;\=\.:\%\^\*\$\@\!\+\_\?\-]/) {
1057 $value =~ s/\t/\\t/g; # substitute tab and newline
1058 # characters
1059 $value =~ s/\n/\\n/g; # to their UNIX equivalents
1060
1061 # Unescaped quotes are not allowed in GFF3
1062 # $value = '"' . $value . '"';
1063 }
1064 $value =~ s/([\t\n\r%&\=;,])/sprintf("%%%X",ord($1))/ge;
1065 } else {
1066 # if it is completely empty, then just make empty double quotes
1067 $value = '""';
1068 }
1069 push @v, $value;
1070 }
1071 # can we figure out how to improve this?
1072 $tag = lcfirst($tag) unless ( $tag =~
1073 /^(ID|Name|Alias|Parent|Gap|Target|Derives_from|Note|Dbxref|Ontology_term)$/);
1074
1075 push @groups, "$tag=".join(",",@v);
1076 }
1077 # Add Target information for Feature Pairs
1078 if( $feat->has_tag('Target') &&
1079 ! $feat->has_tag('Group') &&
1080 $origfeat->isa('Bio::SeqFeature::FeaturePair') ) {
1081
1082 my $target_id = $origfeat->feature1->seq_id;
1083 $target_id =~ s/([\t\n\r%&\=;,])/sprintf("%%%X",ord($1))/ge;
1084
1085 push @groups, sprintf("Target=%s %d %d",
1086 $target_id,
1087 ( $origfeat->feature1->strand < 0 ?
1088 ( $origfeat->feature1->end,
1089 $origfeat->feature1->start) :
1090 ( $origfeat->feature1->start,
1091 $origfeat->feature1->end)
1092 ));
1093 }
1094
1095 # unshift @groups, "ID=autogenerated$ID" unless ($feat->has_tag('ID'));
1096 if ( $feat->can('name') && defined($feat->name) ) {
1097 # such as might be for Bio::DB::SeqFeature
1098 unshift @groups, 'Name=' . $feat->name;
1099 }
1100
1101 my $gff_string = "";
1102 if ($feat->location->isa("Bio::Location::SplitLocationI")) {
1103 my @locs = $feat->location->each_Location;
1104 foreach my $loc (@locs) {
1105 $gff_string .= join("\t",
1106 $name,
1107 $feat->source_tag() || '.',
1108 $feat->primary_tag(),
1109 $loc->start(),
1110 $loc->end(),
1111 $score,
1112 $strand,
1113 $frame,
1114 join(';', @groups)) . "\n";
1115 }
1116 chop $gff_string;
1117 return $gff_string;
1118 } else {
1119 $gff_string = join("\t",
1120 $name,
1121 $feat->source_tag() || '.',
1122 $feat->primary_tag(),
1123 $feat->start(),
1124 $feat->end(),
1125 $score,
1126 $strand,
1127 $frame,
1128 join(';', @groups));
1129 }
1130 return $gff_string;
1131}
1132
1133
1134=head2 gff_version
1135
1136 Title : _gff_version
1137 Usage : $gffversion = $gffio->gff_version
1138 Function:
1139 Example :
1140 Returns : The GFF version this parser will accept and emit.
1141 Args : none
1142
1143=cut
1144
1145sub gff_version {
1146 my ($self, $value) = @_;
1147 if(defined $value && grep {$value == $_ } ( 1, 2, 2.5, 3)) {
1148 $self->{'GFF_VERSION'} = $value;
1149 }
1150 return $self->{'GFF_VERSION'};
1151}
1152
1153
1154# Make filehandles
1155
1156=head2 newFh
1157
1158 Title : newFh
1159 Usage : $fh = Bio::Tools::GFF->newFh(-file=>$filename,-format=>'Format')
1160 Function: does a new() followed by an fh()
1161 Example : $fh = Bio::Tools::GFF->newFh(-file=>$filename,-format=>'Format')
1162 $feature = <$fh>; # read a feature object
1163 print $fh $feature; # write a feature object
1164 Returns : filehandle tied to the Bio::Tools::GFF class
1165 Args :
1166
1167=cut
1168
1169sub newFh {
1170 my $class = shift;
1171 return unless my $self = $class->new(@_);
1172 return $self->fh;
1173}
1174
1175
1176=head2 fh
1177
1178 Title : fh
1179 Usage : $obj->fh
1180 Function:
1181 Example : $fh = $obj->fh; # make a tied filehandle
1182 $feature = <$fh>; # read a feature object
1183 print $fh $feature; # write a feature object
1184 Returns : filehandle tied to Bio::Tools::GFF class
1185 Args : none
1186
1187=cut
1188
1189
1190sub fh {
1191 my $self = shift;
1192 my $class = ref($self) || $self;
1193 my $s = Symbol::gensym;
1194 tie $$s,$class,$self;
1195 return $s;
1196}
1197
1198# This accessor is used for accessing the Bio::Seq objects from a GFF3
1199# file; if the file you are using has no sequence data you can ignore
1200# this accessor
1201
1202# This accessor returns a hash reference containing Bio::Seq objects,
1203# indexed by Bio::Seq->primary_id
1204
1205sub _seq_by_id_h {
1206 my $self = shift;
1207
1208 return $self->{'_seq_by_id_h'} = shift if @_;
1209 $self->{'_seq_by_id_h'} = {}
1210 unless $self->{'_seq_by_id_h'};
1211 return $self->{'_seq_by_id_h'};
1212}
1213
1214
1215=head2 get_seqs
1216
1217 Title : get_seqs
1218 Usage :
1219 Function: Returns all Bio::Seq objects populated by GFF3 file
1220 Example :
1221 Returns :
1222 Args :
1223
1224=cut
1225
1226sub get_seqs {
1227 my ($self,@args) = @_;
1228 return values %{$self->_seq_by_id_h};
1229}
1230
1231
1232=head2 features_attached_to_seqs
1233
1234 Title : features_attached_to_seqs
1235 Usage : $obj->features_attached_to_seqs(1);
1236 Function: For use with GFF3 containg sequence only
1237
1238 Setting this B<before> parsing ensures that all Bio::Seq object
1239 created will have the appropriate features added to them
1240
1241 defaults to false (off)
1242
1243 Note that this mode will incur higher memory usage because features
1244 will have to be cached until the relevant feature comes along
1245
1246 Example :
1247 Returns : value of features_attached_to_seqs (a boolean)
1248 Args : on set, new value (a boolean, optional)
1249
1250
1251=cut
1252
1253sub features_attached_to_seqs{
1254 my $self = shift;
1255
1256 return $self->{'_features_attached_to_seqs'} = shift if @_;
1257 return $self->{'_features_attached_to_seqs'};
1258}
1259
1260
1261=head2 ignore_sequence
1262
1263 Title : ignore_sequence
1264 Usage : $obj->ignore_sequence(1);
1265 Function: For use with GFF3 containg sequence only
1266
1267 Setting this B<before> parsing means that all sequence data will be
1268 ignored
1269
1270 Example :
1271 Returns : value of ignore_sequence (a boolean)
1272 Args : on set, new value (a boolean, optional)
1273
1274=cut
1275
1276sub ignore_sequence{
1277 my $self = shift;
1278
1279 return $self->{'_ignore_sequence'} = shift if @_;
1280 return $self->{'_ignore_sequence'};
1281}
1282
1283
1284sub DESTROY {
1285 my $self = shift;
1286 $self->close();
1287}
1288
1289sub TIEHANDLE {
1290 my ($class,$val) = @_;
1291 return bless {'gffio' => $val}, $class;
1292}
1293
1294sub READLINE {
1295 my $self = shift;
1296 return $self->{'gffio'}->next_feature() || undef unless wantarray;
1297 my (@list, $obj);
1298 push @list, $obj while $obj = $self->{'gffio'}->next_feature();
1299 return @list;
1300}
1301
1302sub PRINT {
1303 my $self = shift;
1304 $self->{'gffio'}->write_feature(@_);
1305}
1306
1307110µs1;
1308