regex - Matching Barcodes to sequences python? -


i have sequence files , barcode files. barcode files may have barcodes of length "attg, agct, acgt" example. sequence files "attgcccccccggggg, attgtttttttt, agctaaaaa" example. need match barcodes sequences contain them @ beginning. each set of sequences same barcode have calculations on them rest of program (which written already). dont know how them match. ive went through using print statements , part messed "potential_barcode = line(:len(barcode)" line. also, says #simple fasta should reading in matched sequences. i'm pretty new @ made lot of mistakes. help!

bcodefname = sys.argv[1] infname = sys.argv[2] barcodefile = open(bcodefname, "r") barcode in barcodefile:         barcode = barcode.strip()         print "barcode: %s" % barcode         outfname = "%s.%s" % (bcodefname,barcode) #           print outfname         outf = open("outfname", "w")         handle = open(infname, "r")         line in handle:                 potential_barcode = line[:len(barcode)]                 print potential_barcode                 if potential_barcode == barcode:                         outseq = line[len(barcode):]                         sys.stdout.write(outseq)                         outf.write(outseq)                         fastafname = infname + ".fasta"                         print fastafname                         mafftfname = fastafname + ".mafft"                         stfname = mafftfname + ".stock"                         print stfname #simp fasta# #                       handle = open(infname, "r")                         outf2 = open(fastafname, "w")                         line in handle:                                 linearr = line.split()                                 seqid = linearr[0]                                 seq = linearr[1]                                 outf2.write(">%s\n%s\n" % (seqid,seq)) #                       handle.close() #                       outf.close() #mafft#                         cmd = "mafft %s > %s" % (fastafname,mafftfname)                         sys.stderr.write("command: %s\n" % cmd)                         os.system(cmd)                         sys.stderr.write("command done\n") 

i don't know why code isn't working you. here tips improving code:

you're reading entire sequence file every barcode. if there 100 barcodes, you're reading through sequence file 100 times. should instead read barcode file once , create list of barcodes.

first define function we'll use check matches:

def matches_barcode(sequence, barcodes):   bc in barcodes:     if sequence.startswith(bc):       return true   return false 

(note used startswith instead of constructing new string , comparing it; startswith should faster.)

now read barcodes file:

barcodes = [] open(bcodefname, "r") barcodefile:   barcode in barcodefile:     barcodes.append(barcode.strip()) 

(note used with open...; code leaks open files prevent program working if have lot of barcodes.)

then read through sequence file once, checking whether each sequence matches barcode:

with open(infname, "r") seqfile:   sequence in seqfile:     if matches_barcode(sequence.strip(), barcodes):       # got match, continue processing. 

this lot faster because less i/o: reads 2 files instead of n + 1 files, n number of barcodes. it's still pretty naive algorithm, though: if it's slow, you'll have investigate more sophisticated algorithms checking match.

if still don't matches expect, you'll need debug: print out exact strings being compared can see what's happening. it's idea use repr in these situations see data, including whitespace , everything.


Comments

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -