python - How to reindex csv data efficiently? -
i have file downloaded of tick data internet. looks this. file relatively "large"
time,bid,bid_depth,bid_depth_total,offer,offer_depth,offer_depth_total 20150423t014501,81.79,400,400,81.89,100,100 20150423t100001,81.,100,100,84.36,100,100 20150423t100017,81.,100,100,83.52,500,500 20150423t115258,81.01,500,500,83.52,500,500 ...
i want reindex data can access through time type query:
from pylab import * pandas import * import pandas.io.date_converters conv xle = read_csv('xle.csv') # chunking seems kludy xle = pd.read_csv('xle.csv', chunksize=4) #preferred xle = pd.read_csv('xle.csv', index_col=0, parse_dates=true) can't handle time format? xle = xle.drop_duplicates(cols='time') in xle.index : xle [ 'time' ][ ]= datetime.strptime ( xle [ 'time' ][ i], '%y%m%dt%h%m%s') xle.index = xle [ ' time ' ]; del xle [ 'time'] print xle[['bid','offer']].ix[1000:1015].to_string() # goal, able manipulate data through time index.
my questions are:
- when run in shell, takes quite bit of time 1 file. must doing wrong in approach, goal read many files , merge them pandas dataframe/timeseries
- pandas appears in memory type of approach. people when file(s) big , can't fit in memory? there pandas interface hides pandas data resides file loaded , unloaded needed disk computation progresses?
- it seems more logical apply filter time column being read in operate on later. there way telling read function function call reads column, before storing object in memory?
i'm little lazy figure out what's going on here, going super slow because you're explicitly looping rather using pandas built in vectorized methods. (basically avoid 'for' when using pandas, if possible, , it's possible.)
for in xle.index : xle [ 'time' ][ ]= datetime.strptime ( xle [ 'time' ][ i], '%y%m%dt%h%m%s') xle.index = xle [ ' time ' ]; del xle [ 'time']
you can convert time pandas datetime pretty this:
xle['time'] = pd.to_datetime(xle.time)
i'm not sure why parse_dates
read_csv
didn't work there can use date_parser
, specify specific format way.
then if want make index:
xle = xle.set_index('time')
that should started. once 'time' pandas datetime can sorts of things (just see docs). these things ought pretty fast if things fit in memory. if not, there number of answers @ can that, although buying more memory simplest solution, if feasible.
Comments
Post a Comment