cython - Cythonising Pandas: ctypes for content, index and columns -


i very new cython, yet experiencing extraordinary speedups copying .py .pyx (and cimport cython, numpy etc) , importing ipython3 pyximport. many tutorials start in approach next step being add cdef declarations every data type, can iterators in loops etc. unlike pandas cython tutorials or examples not apply functions speak, more manipulating data using slices, sums , division (etc).

so question is: can increase speed @ code runs stating dataframe contains floats (double), columns int , rows int?

how define type of embedded list? i.e [[int,int],[int]]

here example generates aic score partitioning of df, sorry verbose:

    cimport cython     import numpy np     cimport numpy np     import pandas pd      offcat = [         "breakingpeace",          "damage",          "deception",          "kill",          "miscellaneous",          "royaloffences",          "sexual",          "theft",          "violenttheft"         ]      def partitionaic(empframe, part, offenceestimateframe, returndeathestimate=false):         """empframe dataframe of ints, part nested list of ints, offenceestimate frame df of float"""         """partof/block list of ints"""         """ll, aic,  series/frame of floats"""         ##cython cdefs         cdef int dflen         cdef int puns         cdef int deathpun             cdef int k         cdef int pid         cdef int punish          dflen = empframe.shape[1]         puns = 2         deathpun = 0         partitionmodel = pd.dataframe(index = empframe.index, columns = empframe.columns)          partof in part:             grouping = [puns*x + y x in partof y in list(range(0,puns))]             partgroupsum = empframe.iloc[:,grouping].sum(axis=1)              punish in range(0,puns):                 punishgroup = [x*puns+punish x in partof]                 punishpunishment = ((empframe.iloc[:,punishgroup].sum(axis = 1) + 1/puns).div(partgroupsum+1)).values[np.newaxis].t                 partitionmodel.iloc[:,punishgroup] = punishpunishment         partitionmodel = partitionmodel*offenceestimateframe          if returndeathestimate:             deathprobframe = pd.dataframe([[part]], index=empframe.index, columns=['partition'])             pid,block in enumerate(part):                 deathprobframe[pid] = partitionmodel.iloc[:,block[::puns]].sum(axis=1)             deathprobframe = deathprobframe.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[x] x in  x ]                  idx,x in enumerate(row['partition'])],                 key=lambda x: x[0], reverse=true),axis=1)         ll = (empframe*np.log(partitionmodel.convert_objects(convert_numeric=true))).sum(axis=1)         k = (len(part))*(puns-1)         aic = 2*k-2*ll          if returndeathestimate:             return aic, deathprobframe         else:             return aic 

my advice as possible in pandas. kinda standard advice "get working first, care performance if matters". let's suppose you've done (hopefully you've written tests too), , it's slow:

profile code. (see this answer, or use %prun in ipython).

the output of prun should drive bit improve next.

  1. pandas (make code more pandorable, can a lot).
  2. numpy (not creating intermediary series/dataframes, being careful dtypes)
  3. cython (the last resort).

now, if line slicing (it isn't) put tiny part in cython, remove single python function calls cython function. on point stuff cython should use numpy not pandas, don't think pandas not going lower c (cython can't infer types).


putting entire code cython won't much, want put specific lines, or function calls, performance sensitive. keeping cython focussed way have time.

read enhancing performance section of pandas docs*! here process (prun -> cythonize -> type) gone on step-by-step real-life example.

*full-disclose wrote section of docs! :)


Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -