Matching entities in SQL Server to replace iterative Python script

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:

+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+

Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.

+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+

I wrote a script in Python that does the following for each of the companies in my sample:

Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

Calculate levenshtein with all possible matches (using fuzzywuzzy library)

Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.

I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:

SELECT distinct input.name, 
 target_db.CompanyName, 
 input.Country, 
 input.ReFcode, 
 [dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on 
 (input.ReFcode = target_db.Refcode and 
 LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
 SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc

Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?

edited Feb 11 at 8:14

Mast

7,33663484

asked Feb 10 at 22:18

pawelty

1112

1

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â€“Â dnoeth
Feb 11 at 11:30

add a commentÂ |Â

up vote
2
down vote

favorite

I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:

+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+

+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+

I wrote a script in Python that does the following for each of the companies in my sample:

Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

Calculate levenshtein with all possible matches (using fuzzywuzzy library)

Choose the match with the highest similarity

I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:

SELECT distinct input.name, 
 target_db.CompanyName, 
 input.Country, 
 input.ReFcode, 
 [dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on 
 (input.ReFcode = target_db.Refcode and 
 LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
 SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc

edited Feb 11 at 8:14

Mast

7,33663484

asked Feb 10 at 22:18

pawelty

1112

1

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â€“Â dnoeth
Feb 11 at 11:30

add a commentÂ |Â

up vote
2
down vote

favorite

I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:

+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+

+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+

I wrote a script in Python that does the following for each of the companies in my sample:

Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

Calculate levenshtein with all possible matches (using fuzzywuzzy library)

Choose the match with the highest similarity

I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:

SELECT distinct input.name, 
 target_db.CompanyName, 
 input.Country, 
 input.ReFcode, 
 [dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on 
 (input.ReFcode = target_db.Refcode and 
 LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
 SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc

edited Feb 11 at 8:14

Mast

7,33663484

asked Feb 10 at 22:18

pawelty

1112

I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:

+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+

+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+

I wrote a script in Python that does the following for each of the companies in my sample:

Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

Calculate levenshtein with all possible matches (using fuzzywuzzy library)

Choose the match with the highest similarity

I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:

SELECT distinct input.name, 
 target_db.CompanyName, 
 input.Country, 
 input.ReFcode, 
 [dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on 
 (input.ReFcode = target_db.Refcode and 
 LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
 SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc

edited Feb 11 at 8:14

Mast

7,33663484

asked Feb 10 at 22:18

pawelty

1112

edited Feb 11 at 8:14

Mast

7,33663484

edited Feb 11 at 8:14

Mast

7,33663484

edited Feb 11 at 8:14

Mast

7,33663484

asked Feb 10 at 22:18

pawelty

1112

asked Feb 10 at 22:18

pawelty

1112

asked Feb 10 at 22:18

pawelty

1112

1

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â€“Â dnoeth
Feb 11 at 11:30

add a commentÂ |Â

1

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â€“Â dnoeth
Feb 11 at 11:30

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â€“Â dnoeth
Feb 11 at 11:30

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187281%2fmatching-entities-in-sql-server-to-replace-iterative-python-script%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr