apache spark - Mllib missing values handling -
i'm using corr mllib basic interface
val a:rdd[double] = sc.makerdd(seq(1., 1., 0.)) val b:rdd[double] = sc.makerdd(seq(1., -1., 0.)) val r = statistics.corr(a, b) println(r) is there possibility have casewise or pairwise removal of nan , infinity values?
by default mllib provides nan result of corr in case of infinity or nan values.
to knowledge, there no built-in function , need filter values out own. 1 approach use java.double (http://docs.oracle.com/javase/7/docs/api/java/lang/double.html) functionality:
import java.lang.double.isnan import java.lang.double.isinfinite val filtered1 = data1.filter((!isnan(_))&&(!isinfinite(_))) val filtered2 = data2.filter((!isnan(_))&&(!isinfinite(_))) val r = statistics.corr(filtered1, filtered2) println(r)
Comments
Post a Comment