perl: finding the region with the greatest width from a list -


i have table having following structure

gene    transcript    exon    length       nm_1          1       10       nm_1          2       5       nm_1          3       20       nm_2          1       10       nm_2          2       5       nm_2          3       50 b       nm_5          1       10 ...     ...           ...     ... 

so basically, table consists of column human genes. second column contains transcript name. same gene can have multiple transcripts. third column contains exon number. every gene consists of multiple exons. fourth column contains length of each exon.

now want create new table looking this:

 gene   transcript    length       nm_2          65  b      nm_5          10  ...    ...           ... 

so want find longest transcript each gene. means when there multiple transcripts (column transcript) each gene (column gene), need make sum of values in length column exons of transcript of gene.

so in example there 2 transcripts gene a: nm_1 , nm_2. each has 3 exons. sum of these 3 values nm_1 = 10+5+20 = 35, nm_2 it's 10+5+50 = 65. gene a, nm_2 longest transcript, want put in new table. gene b there 1 transcript, 1 exon of length 10. in new table, want length of transcript reported.

i've worked hashes before, thought of storing 'gene' , 'transcript' 2 different keys:

#! /usr/bin/perl use strict; use warnings;  open(my $test,'<',"test.txt") || die ("could not open file $!"); open(my $output, '+>', "output.txt") || die ("can't write new file: $!");  # skip header of $test # know how  %hash = (); while(<$test>){     chomp;     @cols = split(/\t/);     $keyfield = $cols[0]; #gene name     $keyfield2 = $cols[1]; # transcript name     push @{ $hash{$keyfield} }, $keyfield2; 

...

given you're trying do, i'd thinking this:

use strict; use warnings;  %genes;  $header_line = <data>;  #read data while (<data>) {     ( $gene, $transcript, $exon, $length ) = split;     $genes{$gene}{$transcript} += $length; }  print join( "\t", "gene", "transcript", "length_sum" ), "\n";  foreach $gene ( keys %genes ) {     #sort length_sum, , 'pop' top of list.      ($longest_transcript) =         ( sort { $genes{$gene}{$b} <=> $genes{$gene}{$a} or $a cmp $b }             keys %{ $genes{$gene} } );     print join( "\t",         $gene, $longest_transcript, $genes{$gene}{$longest_transcript} ),         "\n"; }   __data__  gene    transcript    exon    length       nm_1          1       10       nm_1          2       5       nm_1          3       20       nm_2          1       10       nm_2          2       5       nm_2          3       50 b       nm_5          1       10 

output

gene    transcript  length_sum b   nm_5    10   nm_2    65 

Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -