python - Resampling in scikit-learn and/or pandas -

September 15, 2012

is there built in function in either pandas or scikit-learn resampling according specified strategy? want resample data based on categorical variable.

for example, if data has 75% men , 25% women, i'd train model on 50% men , 50% women. (i'd able generalize cases aren't 50/50)

what need resamples data according specified proportions.

stratified sampling means class distribution preserved. if looking this, can still use stratifiedkfold , stratifiedshufflesplit, long have categorical variable want ensure have same distribution in each fold. use variable instead of target variable. example if have categorical variable in column i,

skf = cross_validation.stratifiedkfold(x[:,i])

however if understand correctly, want resample target distribution (e.g. 50/50) of 1 of categorical features. guess have come own method such sample (split dataset variable value, take same number of random samples each split). if main motivation balance training set classifier, trick adjust sample_weights. can set weights balance training set according desired variable:

sample_weights = sklearn.preprocessing.balance_weights(x[:,i]) clf = svm.svc() clf_weights.fit(x, y, sample_weight=sample_weights)

for non-uniform target distribution, have adjust sample_weights accordingly.

Search This Blog

UV code

python - Resampling in scikit-learn and/or pandas -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -