Build a timeseries dataframe in pandas using a start date / end date -


i attempting complete cohort analysis in pandas. able @ activity of user @ month 1, month 2, month n , group when user signed up. use groupby accomplish this, if user inactive during particular month, not show in data month.

ideally, here dataframe trying build:

| userid | signupdate | visitmonth | visits | |--------|------------|------------|--------| | 1      | 2014-03    | 2014-03    | 2      | | 1      | 2014-03    | 2014-04    | 1      | | 1      | 2014-03    | 2014-05    | 0      | | 1      | 2014-03    | 2014-06    | 0      | | 1      | 2014-03    | 2014-07    | 0      | | 1      | 2014-03    | 2014-08    | 0      | | 1      | 2014-03    | 2014-09    | 0      | | 1      | 2014-03    | 2014-10    | 1      | | 1      | 2014-03    | 2014-11    | 0      | | 1      | 2014-03    | 2014-12    | 0      | | 1      | 2014-03    | 2015-01    | 2      | | 1      | 2014-03    | 2015-02    | 1      | | 1      | 2014-03    | 2015-03    | 0      | | 2      | 2015-01    | 2015-01    | 2      | | 2      | 2015-01    | 2015-02    | 0      | | 2      | 2015-01    | 2015-03    | 2      | 

here sample data:

#sample of users users = [{'userid': 'u1', 'signupdate': '2014-03-15'}, {'userid': 'u2', 'signupdate': '2015-01-10'}]  #create dataframe users usersdf = pd.dataframe(data=users)  #convert signupdate object date usersdf['signupdate'] = pd.to_datetime(usersdf['signupdate'])  #add new column build month-only  usersdf['signupmonth'] = usersdf['signupdate'].dt.to_period('m')  #sample of visits visits =[{'userid': 'u1', 'visitdate': '2014-03-15'}, {'userid': 'u1', 'visitdate': '2014-03-28'}, {'userid': 'u1', 'visitdate': '2014-04-10'}, {'userid': 'u1', 'visitdate': '2014-10-28'}, {'userid': 'u1', 'visitdate': '2015-01-02'}, {'userid': 'u2', 'visitdate': '2015-01-10'}, {'userid': 'u2', 'visitdate': '2015-01-11'}, {'userid': 'u1', 'visitdate': '2014-01-20'}, {'userid': 'u1', 'visitdate': '2014-02-15'}, {'userid': 'u2', 'visitdate': '2015-03-01'}, {'userid': 'u2', 'visitdate': '2015-03-18'}]  #create dataframe of visits visitdf = pd.dataframe(data=visits)  #convert visitdate object date visitdf['visitdate'] = pd.to_datetime(visitdf['visitdate'])  #add new column build month-only  visitdf['visitmonth'] = visitdf['visitdate'].dt.to_period('m')  #group visits user month visits_grouped = visitdf.groupby(['userid', 'visitmonth']).agg(['count'])  #remove index on userid can `merge` visits_grouped.reset_index(inplace=true)  #grouped dataframe cohortdf = visits_grouped.merge(usersdf[['userid', 'signupmonth']], on='userid', how='left') 

this groupby/merge builds dataframe there visits, want able aggregate other fields regardless of if there visit or not.

grouped df output:

  userid visitmonth visitdate                         count 0     u1    2014-01         1 1     u1    2014-02         1 2     u1    2014-03         2 3     u1    2014-04         1 4     u1    2014-10         1 5     u1    2015-01         1 6     u2    2015-01         2 7     u2    2015-03         2 

how tried solve this

i figured for loop work this, having trouble breaking apart dfs / datatypes of df. here rough (not working @ all) code of how thought this:

note: code not work

for user in range(0, len(usersdf)):     theuserid = usersdf['userid'][user]     thesignupmonth = usersdf['signupmonth'][user]     analysismonth = pd.to_datetime(datetime.datetime(2015,03,31)).to_period('m')     numperiods = analysismonth - thesignupmonth     months = pd.date_range(thesignupmonth, periods=numperiods, freq='m')     month in months:         df.append([theuserid, thesignupmonth, month]) 

there few issues approach.

  1. using loop doesn't seem 'panda-ish' me.
  2. the numperiods part doesn't work. know there ways datetime determine number of days/months/etc between 2 dates, trying calculate difference between month series not work.
  3. incrementally building dataframe (per userid, per month) has given me bit of problem since we're dealing strings , panda timeseries objects.

how build dataframe looking for?

here way using example dataframe:

add helper column aggregate (needed other columns used in indices of pivot table)

visitdf['counter'] = 1 

then create pivot table visitdate column in index itself, , other dimensions in other.

df =pd.pivot_table(visitdf,index = 'visitdate', columns=['visitmonth','userid'],\                values='counter', aggfunc='sum'). 

then resample index day. creates nan values days no values

df =df.resample('d',how='sum') 

then unstack frame, reset index , fill nan 0 frame need.

df.unstack().reset_index().fillna(0) 

Comments

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -