python - Counting consecutive alphabets and hyphens and encode them as run length -
how encode hyphenated fasta format string group consecutive nucleotide , hyphens , encode them run length.
consider sequence "atgc----cgcta-----g---". string has sequence of nucleotide followed sequence of hyphens. trying group consecutive nucleotide letter m , consecutive hyphens letter d , prefix size of sub sequence.
the final result out of encoding should 4m4d5m5d1m3d.
the following pictorial graphic explains further
atgc----cgcta-----g--- | | | | | | v v v v v v 4m 4d 5m 5d 1m 3d when use counter or list.count(), "m":10 "d":12:
from collections import counter seq="atgc----cgcta-----g---" m=0 d=0 cigar=[] char in seq: if char.isalpha(): m+=1 cigar.append("m") else: d+=1 cigar.append("d") print counter(cigar)
this problem ideal itertools.groupby
implementation
from itertools import groupby ''.join('{}{}'.format(len(list(g)), 'dm'[k]) k, g in groupby(seq, key = str.isalpha)) output '4m4d5m5d1m3d'
explanation
notably, key function crucial here. group sequence based on being alphabet or not. once done, should straight forward count size of each group , figure out type of group key element.
some explanation of code
'dm'[k]: nifty way of representing"m" if k == true else "d"len(list(g)): determines size of each group. alternatively, have been writtensum(1 e in g)'{}{}'.format: string formatting create concatenation of consecutive frequency , type''.join(: join list elements string sequence.
Comments
Post a Comment