Scrape webpage with Beautifulsoup and export relevant data to csv
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.
The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?
I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.
The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb
The meat of the code is here though:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")
blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")
records =
#TO DO: handle infinite scrolling on webpage
for block in blocks:
#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"
#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"
#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"
#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"
#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')
if len(ratings) > 0:
#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))
#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))
#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))
#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))
#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))
else:
ratings = "NaN"
records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))
print(records[0])
python python-3.x csv beautifulsoup
add a comment |Â
up vote
2
down vote
favorite
I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.
The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?
I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.
The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb
The meat of the code is here though:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")
blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")
records =
#TO DO: handle infinite scrolling on webpage
for block in blocks:
#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"
#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"
#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"
#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"
#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')
if len(ratings) > 0:
#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))
#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))
#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))
#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))
#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))
else:
ratings = "NaN"
records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))
print(records[0])
python python-3.x csv beautifulsoup
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.
The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?
I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.
The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb
The meat of the code is here though:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")
blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")
records =
#TO DO: handle infinite scrolling on webpage
for block in blocks:
#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"
#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"
#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"
#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"
#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')
if len(ratings) > 0:
#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))
#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))
#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))
#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))
#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))
else:
ratings = "NaN"
records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))
print(records[0])
python python-3.x csv beautifulsoup
I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.
The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?
I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.
The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb
The meat of the code is here though:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")
blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")
records =
#TO DO: handle infinite scrolling on webpage
for block in blocks:
#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"
#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"
#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"
#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"
#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')
if len(ratings) > 0:
#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))
#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))
#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))
#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))
#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))
else:
ratings = "NaN"
records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))
print(records[0])
python python-3.x csv beautifulsoup
asked Jan 24 at 17:16
Jope
111
111
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password