python - re.match takes a long time to finish -

January 15, 2015

i new python , have written following code runs slow.

i have debugged code , found out last re.match() causing code run slow. though previous match same kind of match against same dataframe, comes quickly.

here code:

my_cells = pd.read_csv('somefile',index_col = 'gene/cell line(row)').t my_cells_others = pd.dataframe(index=my_cells.index,columns=[col col in my_cells if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col)]) my_cells_genes = pd.dataframe(index=my_cells.index,columns=[col col in my_cells if re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) none ]) col in my_cells.columns:    if  re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):           my_cells_others [col] = pd.dataframe(my_cells[col])    if  re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col) none:           my_cells_genes [col] =  pd.dataframe(my_cells[col])

i not think problem related regular expressions. code below still running slow.

for col in my_cells_others.columns:     if (col in lst) or col.endswith(' cn') or col.endswith(' mut'):           my_cells_others [col] = my_cells[col] col in my_cells_genes.columns:     if  not ((col in lst) or col.endswith(' cn') or col.endswith(' mut')):         my_cells_genes [col] =  my_cells[col]

"poorly" designed regular expressions can unnecessarily slow.

my guess .*\scn , *\smut combined big string not match, makes slow, since forces script check possible combinations.

as @jedwards said, can replace piece of code

if  re.match('.*\scn$|.*\smut$|^bladder$|^blood$|^bone$|^breast$|^cns$|^gi tract$|^kidney$|^lung$|^other$|^ovary$|^pancreas$|^skin$|^soft tissue$|^thyroid$|^upper aerodigestive$|^uterus$',col):           my_cells_others [col] = pd.dataframe(my_cells[col])

with:

lst = ['bladder', 'blood', 'bone', 'breast', 'cns', 'gi tract', 'kidney', 'lung', 'other', 'ovary', 'pancreas', 'skin',        'soft tissue', 'thyroid', 'upper aerodigestive', 'uterus']  if (col in lst) or col.endswith(' cn') or col.endswith(' mut'):     # stuff

alternatively, if want use re reason, moving .*\scn , *\smut end of regex might help, depending on data, since not forced check combinations unless necessary.

Search This Blog

UV code

python - re.match takes a long time to finish -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -