perl: finding the region with the greatest width from a list -
i have table having following structure
gene transcript exon length nm_1 1 10 nm_1 2 5 nm_1 3 20 nm_2 1 10 nm_2 2 5 nm_2 3 50 b nm_5 1 10 ... ... ... ...
so basically, table consists of column human genes. second column contains transcript name. same gene can have multiple transcripts. third column contains exon number. every gene consists of multiple exons. fourth column contains length of each exon.
now want create new table looking this:
gene transcript length nm_2 65 b nm_5 10 ... ... ...
so want find longest transcript each gene. means when there multiple transcripts (column transcript) each gene (column gene), need make sum of values in length column exons of transcript of gene.
so in example there 2 transcripts gene a: nm_1 , nm_2. each has 3 exons. sum of these 3 values nm_1 = 10+5+20 = 35, nm_2 it's 10+5+50 = 65. gene a, nm_2 longest transcript, want put in new table. gene b there 1 transcript, 1 exon of length 10. in new table, want length of transcript reported.
i've worked hashes before, thought of storing 'gene' , 'transcript' 2 different keys:
#! /usr/bin/perl use strict; use warnings; open(my $test,'<',"test.txt") || die ("could not open file $!"); open(my $output, '+>', "output.txt") || die ("can't write new file: $!"); # skip header of $test # know how %hash = (); while(<$test>){ chomp; @cols = split(/\t/); $keyfield = $cols[0]; #gene name $keyfield2 = $cols[1]; # transcript name push @{ $hash{$keyfield} }, $keyfield2;
...
given you're trying do, i'd thinking this:
use strict; use warnings; %genes; $header_line = <data>; #read data while (<data>) { ( $gene, $transcript, $exon, $length ) = split; $genes{$gene}{$transcript} += $length; } print join( "\t", "gene", "transcript", "length_sum" ), "\n"; foreach $gene ( keys %genes ) { #sort length_sum, , 'pop' top of list. ($longest_transcript) = ( sort { $genes{$gene}{$b} <=> $genes{$gene}{$a} or $a cmp $b } keys %{ $genes{$gene} } ); print join( "\t", $gene, $longest_transcript, $genes{$gene}{$longest_transcript} ), "\n"; } __data__ gene transcript exon length nm_1 1 10 nm_1 2 5 nm_1 3 20 nm_2 1 10 nm_2 2 5 nm_2 3 50 b nm_5 1 10
output
gene transcript length_sum b nm_5 10 nm_2 65
Comments
Post a Comment