Matching entities in SQL Server to replace iterative Python script

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_dbto only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?
performance sql sql-server edit-distance
add a comment |Â
up vote
2
down vote
favorite
I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_dbto only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?
performance sql sql-server edit-distance
1
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_dbto only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?
performance sql sql-server edit-distance
I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_dbto only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?
performance sql sql-server edit-distance
edited Feb 11 at 8:14
Mast
7,33663484
7,33663484
asked Feb 10 at 22:18
pawelty
1112
1112
1
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30
add a comment |Â
1
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30
1
1
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187281%2fmatching-entities-in-sql-server-to-replace-iterative-python-script%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316
â dnoeth
Feb 11 at 11:30