Matching entities in SQL Server to replace iterative Python script

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:



+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+


Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.



+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+


I wrote a script in Python that does the following for each of the companies in my sample:



  1. Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

  2. Calculate levenshtein with all possible matches (using fuzzywuzzy library)

  3. Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.



I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:



SELECT distinct input.name, 
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc


Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?







share|improve this question

















  • 1




    Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
    – dnoeth
    Feb 11 at 11:30
















up vote
2
down vote

favorite












I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:



+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+


Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.



+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+


I wrote a script in Python that does the following for each of the companies in my sample:



  1. Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

  2. Calculate levenshtein with all possible matches (using fuzzywuzzy library)

  3. Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.



I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:



SELECT distinct input.name, 
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc


Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?







share|improve this question

















  • 1




    Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
    – dnoeth
    Feb 11 at 11:30












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:



+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+


Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.



+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+


I wrote a script in Python that does the following for each of the companies in my sample:



  1. Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

  2. Calculate levenshtein with all possible matches (using fuzzywuzzy library)

  3. Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.



I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:



SELECT distinct input.name, 
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc


Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?







share|improve this question













I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:



+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+


Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.



+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+


I wrote a script in Python that does the following for each of the companies in my sample:



  1. Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4

  2. Calculate levenshtein with all possible matches (using fuzzywuzzy library)

  3. Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.



I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:



SELECT distinct input.name, 
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7

ORDER BY levenshtein asc


Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?









share|improve this question












share|improve this question




share|improve this question








edited Feb 11 at 8:14









Mast

7,33663484




7,33663484









asked Feb 10 at 22:18









pawelty

1112




1112







  • 1




    Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
    – dnoeth
    Feb 11 at 11:30












  • 1




    Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
    – dnoeth
    Feb 11 at 11:30







1




1




Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
– dnoeth
Feb 11 at 11:30




Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
– dnoeth
Feb 11 at 11:30















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187281%2fmatching-entities-in-sql-server-to-replace-iterative-python-script%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187281%2fmatching-entities-in-sql-server-to-replace-iterative-python-script%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods