python - re.match takes a long time to finish -
i new python , have written following code runs slow.
i have debugged code , found out last re.match()
causing code run slow. though previous match same kind of match against same dataframe, comes quickly.
here code:
my_cells = pd.read_csv('somefile',index_col = 'gene/cell line(row)').t my_cells_others = pd.dataframe(index=my_cells.index,columns=[col col in my_cells if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)]) my_cells_genes = pd.dataframe(index=my_cells.index,columns=[col col in my_cells if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) none ]) col in my_cells.columns: if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col): my_cells_others [col] = pd.dataframe(my_cells[col]) if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) none: my_cells_genes [col] = pd.dataframe(my_cells[col])
i not think problem related regular expressions. code below still running slow.
for col in my_cells_others.columns: if (col in lst) or col.endswith(' cn') or col.endswith(' mut'): my_cells_others [col] = my_cells[col] col in my_cells_genes.columns: if not ((col in lst) or col.endswith(' cn') or col.endswith(' mut')): my_cells_genes [col] = my_cells[col]
"poorly" designed regular expressions can unnecessarily slow.
my guess .*\scn
, *\smut
combined big string not match, makes slow, since forces script check possible combinations.
as @jedwards said, can replace piece of code
if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col): my_cells_others [col] = pd.dataframe(my_cells[col])
with:
lst = ['bladder', 'blood', 'bone', 'breast', 'cns', 'gi tract', 'kidney', 'lung', 'other', 'ovary', 'pancreas', 'skin', 'soft tissue', 'thyroid', 'upper aerodigestive', 'uterus'] if (col in lst) or col.endswith(' cn') or col.endswith(' mut'): # stuff
alternatively, if want use re
reason, moving .*\scn
, *\smut
end of regex might help, depending on data, since not forced check combinations unless necessary.
Comments
Post a Comment