python - Counting consecutive alphabets and hyphens and encode them as run length -


how encode hyphenated fasta format string group consecutive nucleotide , hyphens , encode them run length.

consider sequence "atgc----cgcta-----g---". string has sequence of nucleotide followed sequence of hyphens. trying group consecutive nucleotide letter m , consecutive hyphens letter d , prefix size of sub sequence.

the final result out of encoding should 4m4d5m5d1m3d.

the following pictorial graphic explains further

atgc----cgcta-----g---  |   |    |    |  |  |  v   v    v    v  v  v 4m   4d  5m    5d 1m 3d 

when use counter or list.count(), "m":10 "d":12:

from collections import counter  seq="atgc----cgcta-----g---"  m=0 d=0     cigar=[]  char in seq:         if char.isalpha():         m+=1         cigar.append("m")        else:         d+=1         cigar.append("d")  print counter(cigar) 

this problem ideal itertools.groupby

implementation

from itertools import groupby ''.join('{}{}'.format(len(list(g)), 'dm'[k])          k, g in groupby(seq, key = str.isalpha)) 

output '4m4d5m5d1m3d'

explanation

notably, key function crucial here. group sequence based on being alphabet or not. once done, should straight forward count size of each group , figure out type of group key element.

some explanation of code

  • 'dm'[k]: nifty way of representing "m" if k == true else "d"
  • len(list(g)): determines size of each group. alternatively, have been written sum(1 e in g)
  • '{}{}'.format: string formatting create concatenation of consecutive frequency , type
  • ''.join(: join list elements string sequence.

Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -