python - Defining a custom pandas aggregation function using Cython -
i have big dataframe in pandas 3 columns: 'col1' string, 'col2' , 'col3' numpy.int64. need groupby, apply custom aggregation function using apply, follows:
pd = pandas.read_csv(...) groups = pd.groupby('col1').apply(my_custom_function) each group can seen numpy array 2 integers columns 'col2' , 'col3'. understand doing, can think of each row ('col2','col3') time interval; checking whether there no intervals intersecting. first sort array first column, test whether second column value @ index i smaller first column value @ index + 1.
first question: idea use cython define custom aggregate function. idea?
i tried following definition in .pyx file:
cimport nump c_np def c_my_custom_function(my_group_df): cdef py_ssize_t l = len(my_group_df.index) if l < 2: return false cdef c_np.int64_t[:, :] temp_array temp_array = my_group_df[['col2','col3']].sort(columns='col2').values cdef py_ssize_t in range(l - 1): if temp_array[i, 1] > temp_array[i + 1, 0]: return true return false i defined version in pure python/pandas:
def my_custom_function(my_group_df): l = len(my_group_df.index) if l < 2: return false temp_array = my_group_df[['col2', 'col3']].sort(columns='col2').values in range(l - 1): if temp_array[i, 1] > temp_array[i + 1, 0]: return true return false second question: timed 2 versions, , both take same time. cython version not seem speed anything. happening?
bonus question: see better way implement algorithm?
a vector numpy test be:
np.any(temp_array[:-1,1]>temp_array[1:,0]) whether better python or cython iteration depends on true occurs, if @ all. if return @ step in iteration, iteration better. , cython version won't have of advantage. test step faster sort step.
but if iteration steps way through, vector test faster python iteration, , faster sort. may though slower coded cython iteration.
Comments
Post a Comment