python - Resampling in scikit-learn and/or pandas -
is there built in function in either pandas or scikit-learn resampling according specified strategy? want resample data based on categorical variable.
for example, if data has 75% men , 25% women, i'd train model on 50% men , 50% women. (i'd able generalize cases aren't 50/50)
what need resamples data according specified proportions.
stratified sampling means class distribution preserved. if looking this, can still use stratifiedkfold
, stratifiedshufflesplit
, long have categorical variable want ensure have same distribution in each fold. use variable instead of target variable. example if have categorical variable in column i
,
skf = cross_validation.stratifiedkfold(x[:,i])
however if understand correctly, want resample target distribution (e.g. 50/50) of 1 of categorical features. guess have come own method such sample (split dataset variable value, take same number of random samples each split). if main motivation balance training set classifier, trick adjust sample_weights
. can set weights balance training set according desired variable:
sample_weights = sklearn.preprocessing.balance_weights(x[:,i]) clf = svm.svc() clf_weights.fit(x, y, sample_weight=sample_weights)
for non-uniform target distribution, have adjust sample_weights accordingly.
Comments
Post a Comment