regex - Matching Barcodes to sequences python? -
i have sequence files , barcode files. barcode files may have barcodes of length "attg, agct, acgt" example. sequence files "attgcccccccggggg, attgtttttttt, agctaaaaa" example. need match barcodes sequences contain them @ beginning. each set of sequences same barcode have calculations on them rest of program (which written already). dont know how them match. ive went through using print statements , part messed "potential_barcode = line(:len(barcode)" line. also, says #simple fasta should reading in matched sequences. i'm pretty new @ made lot of mistakes. help!
bcodefname = sys.argv[1] infname = sys.argv[2] barcodefile = open(bcodefname, "r") barcode in barcodefile: barcode = barcode.strip() print "barcode: %s" % barcode outfname = "%s.%s" % (bcodefname,barcode) # print outfname outf = open("outfname", "w") handle = open(infname, "r") line in handle: potential_barcode = line[:len(barcode)] print potential_barcode if potential_barcode == barcode: outseq = line[len(barcode):] sys.stdout.write(outseq) outf.write(outseq) fastafname = infname + ".fasta" print fastafname mafftfname = fastafname + ".mafft" stfname = mafftfname + ".stock" print stfname #simp fasta# # handle = open(infname, "r") outf2 = open(fastafname, "w") line in handle: linearr = line.split() seqid = linearr[0] seq = linearr[1] outf2.write(">%s\n%s\n" % (seqid,seq)) # handle.close() # outf.close() #mafft# cmd = "mafft %s > %s" % (fastafname,mafftfname) sys.stderr.write("command: %s\n" % cmd) os.system(cmd) sys.stderr.write("command done\n")
i don't know why code isn't working you. here tips improving code:
you're reading entire sequence file every barcode. if there 100 barcodes, you're reading through sequence file 100 times. should instead read barcode file once , create list of barcodes.
first define function we'll use check matches:
def matches_barcode(sequence, barcodes): bc in barcodes: if sequence.startswith(bc): return true return false
(note used startswith
instead of constructing new string , comparing it; startswith
should faster.)
now read barcodes file:
barcodes = [] open(bcodefname, "r") barcodefile: barcode in barcodefile: barcodes.append(barcode.strip())
(note used with open...
; code leaks open files prevent program working if have lot of barcodes.)
then read through sequence file once, checking whether each sequence matches barcode:
with open(infname, "r") seqfile: sequence in seqfile: if matches_barcode(sequence.strip(), barcodes): # got match, continue processing.
this lot faster because less i/o: reads 2 files instead of n + 1 files, n number of barcodes. it's still pretty naive algorithm, though: if it's slow, you'll have investigate more sophisticated algorithms checking match.
if still don't matches expect, you'll need debug: print out exact strings being compared can see what's happening. it's idea use repr
in these situations see data, including whitespace , everything.
Comments
Post a Comment