sql server - T-SQL Get Rows With Similar Company Name Using Levenshtein -
i'm using levenshtein function t-sql works (i'm not worried performance). want write query returns rows levenshtein distance less x (where x might 5 example) using company name field comparison.
i've tried following, returns thousands of duplicate rows.
select * contacts c1, contacts c2 dbo.ufncomparestring(c1.company, c2.company) < 5 , c1.id <> c2.id
i show list this:
1 apple experts 20 apple experts inc. 240 h&k paving 21 h , k paving 98 hk paving 189 h.k. paving 5 j.m. lawn care 105 jm lawn care
is possible this? doing wrong?
edit
i ended query looks this. found there "invalid" entries causing problems having:
select c1.contactid, c1.company, c1.lastname, c1.firstname, c2.contactid, c2.company, c2.lastname, c2.firstname contacts c1, contacts c2 cast(c1.contactid int) < cast(c2.contactid int) , c1.company not null , replace(c1.company, ' ', '') <> '' , c2.company not null , replace(c2.company, ' ', '') <> '' , len(c1.company) > 6 , len(c2.company) > 6 , dbo.ufncomparestring(c1.company, c2.company) < 5
note query pretty slow running (on 12,000 records) , have different query more effective. goal find duplicate companies had been entered using different company names , query returned many false positives. query used, it's complicated show here , outside scope of question.
to reduce duplicates, use instead:
select * contacts c1, contacts c2 dbo.ufncomparestring(c1.company, c2.company) < 5 , c1.id < c2.id
it returns unique pairs of contacts, distance less 5.
Comments
Post a Comment