r - "fuzzy key matching" for a data.table merge -
i'm trying match workers year year using name strings , measure of experience. experience can increase @ 1 year year, i'd use matching when other metrics fail.
for example:
dt1<-data.table(name=c("jane doe","jane doe", "john doe","jane smith"), exp=c(0.,5,1,2),id=1:4,key="name") dt2<-data.table(name=c("jane doe","jane doe", "john doe","jane smith"), exp=c(0,30,1.5,2),key="name")
i want match first "jane doe" in dt1
first "jane doe" in dt2
. latter "jane doe"s don't match, because they're different people (based on vastly different experience levels).
i'd add flags know matched these people in way later on down line. here's first pass:
dt2[dt1,`:=`(id=ifelse(exp<=i.exp+1,i.id,na), flag=ifelse(exp<=i.exp+1,i.id,na))]
but not working--here's gives me:
> dt2 name exp id flag 1: jane doe 0.0 2 2 2: jane doe 30.0 na na 3: jane smith 2.0 4 4 4: john doe 1.5 3 3
it seems have missed matching latter "jane doe", appears have matched first "jane doe" wrong prior "jane doe". i'm not quite sure why is; anyway, seems preferable have way incorporate matching on exp
before instead of after joining--this clean ifelse
mess in defining new variables. suggestions?
for clarity, here's desired output:
> dt2 name exp id flag 1: jane doe 1.0 1 1 2: jane doe 30.0 na na 3: jane smith 2.0 4 1 4: john doe 1.5 3 1
in case join isn't "fuzzy". trying join name
exp
while allowing 1 one year distance per match. use rolling join -1l
specification.
first correctly key data sets
setkey(dt1, name, exp) setkey(dt2, name, exp)
then, perform rolling join while passing -1l
value
dt2[dt1, `:=`(id = i.id, flag = 1l), roll = -1l] df2 # name exp id flag # 1: jane doe 0.0 1 1 # 2: jane doe 30.0 na na # 3: jane smith 2.0 4 1 # 4: john doe 1.5 3 1
in future, if you''ll need conduct interval join such c(1l, -1l)
can take here examples of foverlaps
function.
Comments
Post a Comment