Build a timeseries dataframe in pandas using a start date / end date -
i attempting complete cohort analysis in pandas. able @ activity of user @ month 1, month 2, month n , group when user signed up. use groupby accomplish this, if user inactive during particular month, not show in data month.
ideally, here dataframe trying build:
| userid | signupdate | visitmonth | visits | |--------|------------|------------|--------| | 1 | 2014-03 | 2014-03 | 2 | | 1 | 2014-03 | 2014-04 | 1 | | 1 | 2014-03 | 2014-05 | 0 | | 1 | 2014-03 | 2014-06 | 0 | | 1 | 2014-03 | 2014-07 | 0 | | 1 | 2014-03 | 2014-08 | 0 | | 1 | 2014-03 | 2014-09 | 0 | | 1 | 2014-03 | 2014-10 | 1 | | 1 | 2014-03 | 2014-11 | 0 | | 1 | 2014-03 | 2014-12 | 0 | | 1 | 2014-03 | 2015-01 | 2 | | 1 | 2014-03 | 2015-02 | 1 | | 1 | 2014-03 | 2015-03 | 0 | | 2 | 2015-01 | 2015-01 | 2 | | 2 | 2015-01 | 2015-02 | 0 | | 2 | 2015-01 | 2015-03 | 2 | here sample data:
#sample of users users = [{'userid': 'u1', 'signupdate': '2014-03-15'}, {'userid': 'u2', 'signupdate': '2015-01-10'}] #create dataframe users usersdf = pd.dataframe(data=users) #convert signupdate object date usersdf['signupdate'] = pd.to_datetime(usersdf['signupdate']) #add new column build month-only usersdf['signupmonth'] = usersdf['signupdate'].dt.to_period('m') #sample of visits visits =[{'userid': 'u1', 'visitdate': '2014-03-15'}, {'userid': 'u1', 'visitdate': '2014-03-28'}, {'userid': 'u1', 'visitdate': '2014-04-10'}, {'userid': 'u1', 'visitdate': '2014-10-28'}, {'userid': 'u1', 'visitdate': '2015-01-02'}, {'userid': 'u2', 'visitdate': '2015-01-10'}, {'userid': 'u2', 'visitdate': '2015-01-11'}, {'userid': 'u1', 'visitdate': '2014-01-20'}, {'userid': 'u1', 'visitdate': '2014-02-15'}, {'userid': 'u2', 'visitdate': '2015-03-01'}, {'userid': 'u2', 'visitdate': '2015-03-18'}] #create dataframe of visits visitdf = pd.dataframe(data=visits) #convert visitdate object date visitdf['visitdate'] = pd.to_datetime(visitdf['visitdate']) #add new column build month-only visitdf['visitmonth'] = visitdf['visitdate'].dt.to_period('m') #group visits user month visits_grouped = visitdf.groupby(['userid', 'visitmonth']).agg(['count']) #remove index on userid can `merge` visits_grouped.reset_index(inplace=true) #grouped dataframe cohortdf = visits_grouped.merge(usersdf[['userid', 'signupmonth']], on='userid', how='left') this groupby/merge builds dataframe there visits, want able aggregate other fields regardless of if there visit or not.
grouped df output:
userid visitmonth visitdate count 0 u1 2014-01 1 1 u1 2014-02 1 2 u1 2014-03 2 3 u1 2014-04 1 4 u1 2014-10 1 5 u1 2015-01 1 6 u2 2015-01 2 7 u2 2015-03 2 how tried solve this
i figured for loop work this, having trouble breaking apart dfs / datatypes of df. here rough (not working @ all) code of how thought this:
note: code not work
for user in range(0, len(usersdf)): theuserid = usersdf['userid'][user] thesignupmonth = usersdf['signupmonth'][user] analysismonth = pd.to_datetime(datetime.datetime(2015,03,31)).to_period('m') numperiods = analysismonth - thesignupmonth months = pd.date_range(thesignupmonth, periods=numperiods, freq='m') month in months: df.append([theuserid, thesignupmonth, month]) there few issues approach.
- using loop doesn't seem 'panda-ish' me.
- the numperiods part doesn't work. know there ways
datetimedetermine number of days/months/etc between 2 dates, trying calculate difference between month series not work. - incrementally building dataframe (per userid, per month) has given me bit of problem since we're dealing strings , panda timeseries objects.
how build dataframe looking for?
here way using example dataframe:
add helper column aggregate (needed other columns used in indices of pivot table)
visitdf['counter'] = 1 then create pivot table visitdate column in index itself, , other dimensions in other.
df =pd.pivot_table(visitdf,index = 'visitdate', columns=['visitmonth','userid'],\ values='counter', aggfunc='sum'). then resample index day. creates nan values days no values
df =df.resample('d',how='sum') then unstack frame, reset index , fill nan 0 frame need.
df.unstack().reset_index().fillna(0)
Comments
Post a Comment