Scrape webpage with Beautifulsoup and export relevant data to csv

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.

The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?

I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.

The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb

The meat of the code is here though:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records = 

#TO DO: handle infinite scrolling on webpage

for block in blocks:

 #review number
 review_num = block.select("div b")[0]
 if len(review_num) > 0:
 review_num = review_num.text.strip()
 else:
 review_num = "NaN"

 #reviewer name
 name = block.select("span")[1]
 if len(name) > 0: 
 name = name.text.strip()
 else:
 name = "NaN"

 #date published
 date = block.select('span[itemprop="datePublished"]')
 if len(date) > 0:
 date = date[0].text.strip()
 else:
 name = "NaN"

 #review text 
 review = block.find("span", "itemprop":"reviewBody")
 if review is not None:
 review = review.text.strip() 
 else: 
 review = "NaN"

 #select ratings tag and count the number of icons for each rating type
 ratings = block.select('div[class="mb-0-20"]')

 if len(ratings) > 0:

 #facilities ratings
 fac = ratings[0]
 fac_rating = (len(list(fac.find_all("i"))))

 #service rating
 serv = ratings[1]
 serv_rating = (len(list(serv.find_all("i"))))

 #painless rating
 painless = ratings[2]
 painless_rating = (len(list(painless.find_all("i"))))

 #results rating
 results = ratings[3]
 results_rating = (len(list(results.find_all("i"))))

 #cost rating
 cost = ratings[4]
 cost_rating = (len(list(cost.find_all("i"))))

 else:
 ratings = "NaN"

 records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating)) 

print(records[0])

asked Jan 24 at 17:16

Jope

111

add a commentÂ |Â

up vote
2
down vote

favorite

I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.

The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?

The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb

The meat of the code is here though:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records = 

#TO DO: handle infinite scrolling on webpage

for block in blocks:

 #review number
 review_num = block.select("div b")[0]
 if len(review_num) > 0:
 review_num = review_num.text.strip()
 else:
 review_num = "NaN"

 #reviewer name
 name = block.select("span")[1]
 if len(name) > 0: 
 name = name.text.strip()
 else:
 name = "NaN"

 #date published
 date = block.select('span[itemprop="datePublished"]')
 if len(date) > 0:
 date = date[0].text.strip()
 else:
 name = "NaN"

 #review text 
 review = block.find("span", "itemprop":"reviewBody")
 if review is not None:
 review = review.text.strip() 
 else: 
 review = "NaN"

 #select ratings tag and count the number of icons for each rating type
 ratings = block.select('div[class="mb-0-20"]')

 if len(ratings) > 0:

 #facilities ratings
 fac = ratings[0]
 fac_rating = (len(list(fac.find_all("i"))))

 #service rating
 serv = ratings[1]
 serv_rating = (len(list(serv.find_all("i"))))

 #painless rating
 painless = ratings[2]
 painless_rating = (len(list(painless.find_all("i"))))

 #results rating
 results = ratings[3]
 results_rating = (len(list(results.find_all("i"))))

 #cost rating
 cost = ratings[4]
 cost_rating = (len(list(cost.find_all("i"))))

 else:
 ratings = "NaN"

 records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating)) 

print(records[0])

asked Jan 24 at 17:16

Jope

111

add a commentÂ |Â

up vote
2
down vote

favorite

I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.

The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?

The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb

The meat of the code is here though:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records = 

#TO DO: handle infinite scrolling on webpage

for block in blocks:

 #review number
 review_num = block.select("div b")[0]
 if len(review_num) > 0:
 review_num = review_num.text.strip()
 else:
 review_num = "NaN"

 #reviewer name
 name = block.select("span")[1]
 if len(name) > 0: 
 name = name.text.strip()
 else:
 name = "NaN"

 #date published
 date = block.select('span[itemprop="datePublished"]')
 if len(date) > 0:
 date = date[0].text.strip()
 else:
 name = "NaN"

 #review text 
 review = block.find("span", "itemprop":"reviewBody")
 if review is not None:
 review = review.text.strip() 
 else: 
 review = "NaN"

 #select ratings tag and count the number of icons for each rating type
 ratings = block.select('div[class="mb-0-20"]')

 if len(ratings) > 0:

 #facilities ratings
 fac = ratings[0]
 fac_rating = (len(list(fac.find_all("i"))))

 #service rating
 serv = ratings[1]
 serv_rating = (len(list(serv.find_all("i"))))

 #painless rating
 painless = ratings[2]
 painless_rating = (len(list(painless.find_all("i"))))

 #results rating
 results = ratings[3]
 results_rating = (len(list(results.find_all("i"))))

 #cost rating
 cost = ratings[4]
 cost_rating = (len(list(cost.find_all("i"))))

 else:
 ratings = "NaN"

 records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating)) 

print(records[0])

asked Jan 24 at 17:16

Jope

111

I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.

The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?

The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb

The meat of the code is here though:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records = 

#TO DO: handle infinite scrolling on webpage

for block in blocks:

 #review number
 review_num = block.select("div b")[0]
 if len(review_num) > 0:
 review_num = review_num.text.strip()
 else:
 review_num = "NaN"

 #reviewer name
 name = block.select("span")[1]
 if len(name) > 0: 
 name = name.text.strip()
 else:
 name = "NaN"

 #date published
 date = block.select('span[itemprop="datePublished"]')
 if len(date) > 0:
 date = date[0].text.strip()
 else:
 name = "NaN"

 #review text 
 review = block.find("span", "itemprop":"reviewBody")
 if review is not None:
 review = review.text.strip() 
 else: 
 review = "NaN"

 #select ratings tag and count the number of icons for each rating type
 ratings = block.select('div[class="mb-0-20"]')

 if len(ratings) > 0:

 #facilities ratings
 fac = ratings[0]
 fac_rating = (len(list(fac.find_all("i"))))

 #service rating
 serv = ratings[1]
 serv_rating = (len(list(serv.find_all("i"))))

 #painless rating
 painless = ratings[2]
 painless_rating = (len(list(painless.find_all("i"))))

 #results rating
 results = ratings[3]
 results_rating = (len(list(results.find_all("i"))))

 #cost rating
 cost = ratings[4]
 cost_rating = (len(list(cost.find_all("i"))))

 else:
 ratings = "NaN"

 records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating)) 

print(records[0])

asked Jan 24 at 17:16

Jope

111

asked Jan 24 at 17:16

Jope

111

asked Jan 24 at 17:16

Jope

111

asked Jan 24 at 17:16

Jope

111

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr