Scrape webpage with Beautifulsoup and export relevant data to csv

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



The meat of the code is here though:



from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records =

#TO DO: handle infinite scrolling on webpage

for block in blocks:

#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"

#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"

#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"

#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"

#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')

if len(ratings) > 0:

#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))

#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))

#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))

#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))

#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))

else:
ratings = "NaN"

records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

print(records[0])






share|improve this question

























    up vote
    2
    down vote

    favorite












    I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



    The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



    I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



    The full notebook is here:
    https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



    The meat of the code is here though:



    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import re

    url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, "lxml")

    blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
    one_block = soup.find("div", attrs="class" : "jsReviewItem")

    records =

    #TO DO: handle infinite scrolling on webpage

    for block in blocks:

    #review number
    review_num = block.select("div b")[0]
    if len(review_num) > 0:
    review_num = review_num.text.strip()
    else:
    review_num = "NaN"

    #reviewer name
    name = block.select("span")[1]
    if len(name) > 0:
    name = name.text.strip()
    else:
    name = "NaN"

    #date published
    date = block.select('span[itemprop="datePublished"]')
    if len(date) > 0:
    date = date[0].text.strip()
    else:
    name = "NaN"

    #review text
    review = block.find("span", "itemprop":"reviewBody")
    if review is not None:
    review = review.text.strip()
    else:
    review = "NaN"

    #select ratings tag and count the number of icons for each rating type
    ratings = block.select('div[class="mb-0-20"]')

    if len(ratings) > 0:

    #facilities ratings
    fac = ratings[0]
    fac_rating = (len(list(fac.find_all("i"))))

    #service rating
    serv = ratings[1]
    serv_rating = (len(list(serv.find_all("i"))))

    #painless rating
    painless = ratings[2]
    painless_rating = (len(list(painless.find_all("i"))))

    #results rating
    results = ratings[3]
    results_rating = (len(list(results.find_all("i"))))

    #cost rating
    cost = ratings[4]
    cost_rating = (len(list(cost.find_all("i"))))

    else:
    ratings = "NaN"

    records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

    print(records[0])






    share|improve this question





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



      The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



      I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



      The full notebook is here:
      https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



      The meat of the code is here though:



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import re

      url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
      r = requests.get(url)
      html_doc = r.text
      soup = BeautifulSoup(html_doc, "lxml")

      blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
      one_block = soup.find("div", attrs="class" : "jsReviewItem")

      records =

      #TO DO: handle infinite scrolling on webpage

      for block in blocks:

      #review number
      review_num = block.select("div b")[0]
      if len(review_num) > 0:
      review_num = review_num.text.strip()
      else:
      review_num = "NaN"

      #reviewer name
      name = block.select("span")[1]
      if len(name) > 0:
      name = name.text.strip()
      else:
      name = "NaN"

      #date published
      date = block.select('span[itemprop="datePublished"]')
      if len(date) > 0:
      date = date[0].text.strip()
      else:
      name = "NaN"

      #review text
      review = block.find("span", "itemprop":"reviewBody")
      if review is not None:
      review = review.text.strip()
      else:
      review = "NaN"

      #select ratings tag and count the number of icons for each rating type
      ratings = block.select('div[class="mb-0-20"]')

      if len(ratings) > 0:

      #facilities ratings
      fac = ratings[0]
      fac_rating = (len(list(fac.find_all("i"))))

      #service rating
      serv = ratings[1]
      serv_rating = (len(list(serv.find_all("i"))))

      #painless rating
      painless = ratings[2]
      painless_rating = (len(list(painless.find_all("i"))))

      #results rating
      results = ratings[3]
      results_rating = (len(list(results.find_all("i"))))

      #cost rating
      cost = ratings[4]
      cost_rating = (len(list(cost.find_all("i"))))

      else:
      ratings = "NaN"

      records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

      print(records[0])






      share|improve this question











      I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



      The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



      I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



      The full notebook is here:
      https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



      The meat of the code is here though:



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import re

      url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
      r = requests.get(url)
      html_doc = r.text
      soup = BeautifulSoup(html_doc, "lxml")

      blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
      one_block = soup.find("div", attrs="class" : "jsReviewItem")

      records =

      #TO DO: handle infinite scrolling on webpage

      for block in blocks:

      #review number
      review_num = block.select("div b")[0]
      if len(review_num) > 0:
      review_num = review_num.text.strip()
      else:
      review_num = "NaN"

      #reviewer name
      name = block.select("span")[1]
      if len(name) > 0:
      name = name.text.strip()
      else:
      name = "NaN"

      #date published
      date = block.select('span[itemprop="datePublished"]')
      if len(date) > 0:
      date = date[0].text.strip()
      else:
      name = "NaN"

      #review text
      review = block.find("span", "itemprop":"reviewBody")
      if review is not None:
      review = review.text.strip()
      else:
      review = "NaN"

      #select ratings tag and count the number of icons for each rating type
      ratings = block.select('div[class="mb-0-20"]')

      if len(ratings) > 0:

      #facilities ratings
      fac = ratings[0]
      fac_rating = (len(list(fac.find_all("i"))))

      #service rating
      serv = ratings[1]
      serv_rating = (len(list(serv.find_all("i"))))

      #painless rating
      painless = ratings[2]
      painless_rating = (len(list(painless.find_all("i"))))

      #results rating
      results = ratings[3]
      results_rating = (len(list(results.find_all("i"))))

      #cost rating
      cost = ratings[4]
      cost_rating = (len(list(cost.find_all("i"))))

      else:
      ratings = "NaN"

      records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

      print(records[0])








      share|improve this question










      share|improve this question




      share|improve this question









      asked Jan 24 at 17:16









      Jope

      111




      111

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Will my employers contract hold up in court?