cython - Cythonising Pandas: ctypes for content, index and columns -
i very new cython, yet experiencing extraordinary speedups copying .py
.pyx
(and cimport cython
, numpy
etc) , importing ipython3
pyximport
. many tutorials start in approach next step being add cdef
declarations every data type, can iterators in loops etc. unlike pandas cython tutorials or examples not apply functions speak, more manipulating data using slices, sums , division (etc).
so question is: can increase speed @ code runs stating dataframe contains floats (double
), columns int
, rows int
?
how define type of embedded list? i.e [[int,int],[int]]
here example generates aic score partitioning of df, sorry verbose:
cimport cython import numpy np cimport numpy np import pandas pd offcat = [ "breakingpeace", "damage", "deception", "kill", "miscellaneous", "royaloffences", "sexual", "theft", "violenttheft" ] def partitionaic(empframe, part, offenceestimateframe, returndeathestimate=false): """empframe dataframe of ints, part nested list of ints, offenceestimate frame df of float""" """partof/block list of ints""" """ll, aic, series/frame of floats""" ##cython cdefs cdef int dflen cdef int puns cdef int deathpun cdef int k cdef int pid cdef int punish dflen = empframe.shape[1] puns = 2 deathpun = 0 partitionmodel = pd.dataframe(index = empframe.index, columns = empframe.columns) partof in part: grouping = [puns*x + y x in partof y in list(range(0,puns))] partgroupsum = empframe.iloc[:,grouping].sum(axis=1) punish in range(0,puns): punishgroup = [x*puns+punish x in partof] punishpunishment = ((empframe.iloc[:,punishgroup].sum(axis = 1) + 1/puns).div(partgroupsum+1)).values[np.newaxis].t partitionmodel.iloc[:,punishgroup] = punishpunishment partitionmodel = partitionmodel*offenceestimateframe if returndeathestimate: deathprobframe = pd.dataframe([[part]], index=empframe.index, columns=['partition']) pid,block in enumerate(part): deathprobframe[pid] = partitionmodel.iloc[:,block[::puns]].sum(axis=1) deathprobframe = deathprobframe.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[x] x in x ] idx,x in enumerate(row['partition'])], key=lambda x: x[0], reverse=true),axis=1) ll = (empframe*np.log(partitionmodel.convert_objects(convert_numeric=true))).sum(axis=1) k = (len(part))*(puns-1) aic = 2*k-2*ll if returndeathestimate: return aic, deathprobframe else: return aic
my advice as possible in pandas. kinda standard advice "get working first, care performance if matters". let's suppose you've done (hopefully you've written tests too), , it's slow:
profile code. (see this answer, or use %prun in ipython).
the output of prun should drive bit improve next.
- pandas (make code more pandorable, can a lot).
- numpy (not creating intermediary series/dataframes, being careful dtypes)
- cython (the last resort).
now, if line slicing (it isn't) put tiny part in cython, remove single python function calls cython function. on point stuff cython should use numpy not pandas, don't think pandas not going lower c (cython can't infer types).
putting entire code cython won't much, want put specific lines, or function calls, performance sensitive. keeping cython focussed way have time.
read enhancing performance section of pandas docs*! here process (prun -> cythonize -> type) gone on step-by-step real-life example.
*full-disclose wrote section of docs! :)
Comments
Post a Comment