python - Counting consecutive alphabets and hyphens and encode them as run length -

May 15, 2011

how encode hyphenated fasta format string group consecutive nucleotide , hyphens , encode them run length.

consider sequence "atgc----cgcta-----g---". string has sequence of nucleotide followed sequence of hyphens. trying group consecutive nucleotide letter m , consecutive hyphens letter d , prefix size of sub sequence.

the final result out of encoding should 4m4d5m5d1m3d.

the following pictorial graphic explains further

atgc----cgcta-----g---  |   |    |    |  |  |  v   v    v    v  v  v 4m   4d  5m    5d 1m 3d

when use counter or list.count(), "m":10 "d":12:

from collections import counter  seq="atgc----cgcta-----g---"  m=0 d=0     cigar=[]  char in seq:         if char.isalpha():         m+=1         cigar.append("m")        else:         d+=1         cigar.append("d")  print counter(cigar)

this problem ideal itertools.groupby

implementation

from itertools import groupby ''.join('{}{}'.format(len(list(g)), 'dm'[k])          k, g in groupby(seq, key = str.isalpha))

output '4m4d5m5d1m3d'

explanation

notably, key function crucial here. group sequence based on being alphabet or not. once done, should straight forward count size of each group , figure out type of group key element.

some explanation of code

'dm'[k]: nifty way of representing "m" if k == true else "d"
len(list(g)): determines size of each group. alternatively, have been written sum(1 e in g)
'{}{}'.format: string formatting create concatenation of consecutive frequency , type
''.join(: join list elements string sequence.

Search This Blog

UV code

python - Counting consecutive alphabets and hyphens and encode them as run length -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -