python - Counting consecutive alphabets and hyphens and encode them as run length -
how encode hyphenated fasta format string group consecutive nucleotide , hyphens , encode them run length.
consider sequence "atgc----cgcta-----g---". string has sequence of nucleotide followed sequence of hyphens. trying group consecutive nucleotide letter m
, consecutive hyphens letter d
, prefix size of sub sequence.
the final result out of encoding should 4m4d5m5d1m3d
.
the following pictorial graphic explains further
atgc----cgcta-----g--- | | | | | | v v v v v v 4m 4d 5m 5d 1m 3d
when use counter
or list.count()
, "m":10 "d":12
:
from collections import counter seq="atgc----cgcta-----g---" m=0 d=0 cigar=[] char in seq: if char.isalpha(): m+=1 cigar.append("m") else: d+=1 cigar.append("d") print counter(cigar)
this problem ideal itertools.groupby
implementation
from itertools import groupby ''.join('{}{}'.format(len(list(g)), 'dm'[k]) k, g in groupby(seq, key = str.isalpha))
output '4m4d5m5d1m3d'
explanation
notably, key function crucial here. group sequence based on being alphabet or not. once done, should straight forward count size of each group , figure out type of group key element.
some explanation of code
'dm'[k]
: nifty way of representing"m" if k == true else "d"
len(list(g))
: determines size of each group. alternatively, have been writtensum(1 e in g)
'{}{}'.format
: string formatting create concatenation of consecutive frequency , type''.join(
: join list elements string sequence.
Comments
Post a Comment