python - Defining a custom pandas aggregation function using Cython -
i have big dataframe
in pandas 3 columns: 'col1'
string, 'col2'
, 'col3'
numpy.int64
. need groupby
, apply custom aggregation function using apply
, follows:
pd = pandas.read_csv(...) groups = pd.groupby('col1').apply(my_custom_function)
each group can seen numpy array 2 integers columns 'col2'
, 'col3'
. understand doing, can think of each row ('col2','col3')
time interval; checking whether there no intervals intersecting. first sort array first column, test whether second column value @ index i
smaller first column value @ index + 1
.
first question: idea use cython define custom aggregate function. idea?
i tried following definition in .pyx
file:
cimport nump c_np def c_my_custom_function(my_group_df): cdef py_ssize_t l = len(my_group_df.index) if l < 2: return false cdef c_np.int64_t[:, :] temp_array temp_array = my_group_df[['col2','col3']].sort(columns='col2').values cdef py_ssize_t in range(l - 1): if temp_array[i, 1] > temp_array[i + 1, 0]: return true return false
i defined version in pure python/pandas:
def my_custom_function(my_group_df): l = len(my_group_df.index) if l < 2: return false temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values in range(l - 1): if temp_array[i, 1] > temp_array[i + 1, 0]: return true return false
second question: timed 2 versions, , both take same time. cython version not seem speed anything. happening?
bonus question: see better way implement algorithm?
a vector numpy
test be:
np.any(temp_array[:-1,1]>temp_array[1:,0])
whether better python or cython iteration depends on true
occurs, if @ all. if return @ step in iteration, iteration better. , cython
version won't have of advantage. test step faster sort step.
but if iteration steps way through, vector test faster python iteration, , faster sort. may though slower coded cython iteration.
Comments
Post a Comment