Scrape webpage with Beautifulsoup and export relevant data to csv

Multi tool use
Multi tool use

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



The full notebook is here:
https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



The meat of the code is here though:



from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")

blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
one_block = soup.find("div", attrs="class" : "jsReviewItem")

records =

#TO DO: handle infinite scrolling on webpage

for block in blocks:

#review number
review_num = block.select("div b")[0]
if len(review_num) > 0:
review_num = review_num.text.strip()
else:
review_num = "NaN"

#reviewer name
name = block.select("span")[1]
if len(name) > 0:
name = name.text.strip()
else:
name = "NaN"

#date published
date = block.select('span[itemprop="datePublished"]')
if len(date) > 0:
date = date[0].text.strip()
else:
name = "NaN"

#review text
review = block.find("span", "itemprop":"reviewBody")
if review is not None:
review = review.text.strip()
else:
review = "NaN"

#select ratings tag and count the number of icons for each rating type
ratings = block.select('div[class="mb-0-20"]')

if len(ratings) > 0:

#facilities ratings
fac = ratings[0]
fac_rating = (len(list(fac.find_all("i"))))

#service rating
serv = ratings[1]
serv_rating = (len(list(serv.find_all("i"))))

#painless rating
painless = ratings[2]
painless_rating = (len(list(painless.find_all("i"))))

#results rating
results = ratings[3]
results_rating = (len(list(results.find_all("i"))))

#cost rating
cost = ratings[4]
cost_rating = (len(list(cost.find_all("i"))))

else:
ratings = "NaN"

records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

print(records[0])






share|improve this question

























    up vote
    2
    down vote

    favorite












    I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



    The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



    I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



    The full notebook is here:
    https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



    The meat of the code is here though:



    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import re

    url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, "lxml")

    blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
    one_block = soup.find("div", attrs="class" : "jsReviewItem")

    records =

    #TO DO: handle infinite scrolling on webpage

    for block in blocks:

    #review number
    review_num = block.select("div b")[0]
    if len(review_num) > 0:
    review_num = review_num.text.strip()
    else:
    review_num = "NaN"

    #reviewer name
    name = block.select("span")[1]
    if len(name) > 0:
    name = name.text.strip()
    else:
    name = "NaN"

    #date published
    date = block.select('span[itemprop="datePublished"]')
    if len(date) > 0:
    date = date[0].text.strip()
    else:
    name = "NaN"

    #review text
    review = block.find("span", "itemprop":"reviewBody")
    if review is not None:
    review = review.text.strip()
    else:
    review = "NaN"

    #select ratings tag and count the number of icons for each rating type
    ratings = block.select('div[class="mb-0-20"]')

    if len(ratings) > 0:

    #facilities ratings
    fac = ratings[0]
    fac_rating = (len(list(fac.find_all("i"))))

    #service rating
    serv = ratings[1]
    serv_rating = (len(list(serv.find_all("i"))))

    #painless rating
    painless = ratings[2]
    painless_rating = (len(list(painless.find_all("i"))))

    #results rating
    results = ratings[3]
    results_rating = (len(list(results.find_all("i"))))

    #cost rating
    cost = ratings[4]
    cost_rating = (len(list(cost.find_all("i"))))

    else:
    ratings = "NaN"

    records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

    print(records[0])






    share|improve this question





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



      The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



      I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



      The full notebook is here:
      https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



      The meat of the code is here though:



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import re

      url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
      r = requests.get(url)
      html_doc = r.text
      soup = BeautifulSoup(html_doc, "lxml")

      blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
      one_block = soup.find("div", attrs="class" : "jsReviewItem")

      records =

      #TO DO: handle infinite scrolling on webpage

      for block in blocks:

      #review number
      review_num = block.select("div b")[0]
      if len(review_num) > 0:
      review_num = review_num.text.strip()
      else:
      review_num = "NaN"

      #reviewer name
      name = block.select("span")[1]
      if len(name) > 0:
      name = name.text.strip()
      else:
      name = "NaN"

      #date published
      date = block.select('span[itemprop="datePublished"]')
      if len(date) > 0:
      date = date[0].text.strip()
      else:
      name = "NaN"

      #review text
      review = block.find("span", "itemprop":"reviewBody")
      if review is not None:
      review = review.text.strip()
      else:
      review = "NaN"

      #select ratings tag and count the number of icons for each rating type
      ratings = block.select('div[class="mb-0-20"]')

      if len(ratings) > 0:

      #facilities ratings
      fac = ratings[0]
      fac_rating = (len(list(fac.find_all("i"))))

      #service rating
      serv = ratings[1]
      serv_rating = (len(list(serv.find_all("i"))))

      #painless rating
      painless = ratings[2]
      painless_rating = (len(list(painless.find_all("i"))))

      #results rating
      results = ratings[3]
      results_rating = (len(list(results.find_all("i"))))

      #cost rating
      cost = ratings[4]
      cost_rating = (len(list(cost.find_all("i"))))

      else:
      ratings = "NaN"

      records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

      print(records[0])






      share|improve this question











      I am looking for feedback on my code, how I can simplify some operations, use more efficient methods or other best practices. This is my first web scraping project.



      The goal here is to scrape a dentist's page for reviews and export to csv. I have not solved the issue with infinite scrolling, but ignore that for now. For the existing code, what can I improve?



      I used ipython as I had to do a lot of trial and error to get beautifulsoup to provide me with the right output. The 4th review is in a different format than every other review on the site, and caused Nonetypes to mess up most of my loops. I found solutions for this, but not sure if ideal either. Thanks in advance.



      The full notebook is here:
      https://github.com/joepope44/dentist_reviews/blob/master/upwork-web-scrape-2nd-approach-v2.ipynb



      The meat of the code is here though:



      from bs4 import BeautifulSoup
      import requests
      import pandas as pd
      import re

      url = 'https://www.toothssenger.com/118276-ofallon-dentist-dr-edward-logan#read_review'
      r = requests.get(url)
      html_doc = r.text
      soup = BeautifulSoup(html_doc, "lxml")

      blocks = soup.find_all("div", attrs="class" : "jsReviewItem", limit=20)
      one_block = soup.find("div", attrs="class" : "jsReviewItem")

      records =

      #TO DO: handle infinite scrolling on webpage

      for block in blocks:

      #review number
      review_num = block.select("div b")[0]
      if len(review_num) > 0:
      review_num = review_num.text.strip()
      else:
      review_num = "NaN"

      #reviewer name
      name = block.select("span")[1]
      if len(name) > 0:
      name = name.text.strip()
      else:
      name = "NaN"

      #date published
      date = block.select('span[itemprop="datePublished"]')
      if len(date) > 0:
      date = date[0].text.strip()
      else:
      name = "NaN"

      #review text
      review = block.find("span", "itemprop":"reviewBody")
      if review is not None:
      review = review.text.strip()
      else:
      review = "NaN"

      #select ratings tag and count the number of icons for each rating type
      ratings = block.select('div[class="mb-0-20"]')

      if len(ratings) > 0:

      #facilities ratings
      fac = ratings[0]
      fac_rating = (len(list(fac.find_all("i"))))

      #service rating
      serv = ratings[1]
      serv_rating = (len(list(serv.find_all("i"))))

      #painless rating
      painless = ratings[2]
      painless_rating = (len(list(painless.find_all("i"))))

      #results rating
      results = ratings[3]
      results_rating = (len(list(results.find_all("i"))))

      #cost rating
      cost = ratings[4]
      cost_rating = (len(list(cost.find_all("i"))))

      else:
      ratings = "NaN"

      records.append((review_num, name, date, review, fac_rating, serv_rating, painless_rating, cost_rating))

      print(records[0])








      share|improve this question










      share|improve this question




      share|improve this question









      asked Jan 24 at 17:16









      Jope

      111




      111

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185891%2fscrape-webpage-with-beautifulsoup-and-export-relevant-data-to-csv%23new-answer', 'question_page');

          );

          Post as a guest













































































          AQDGlMo2HUBt,Q,O 8Wy1 k 34BPVwuvGE v3rAGqvyRM B
          ggPhgxYVTpujjcD

          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Read an image with ADNS2610 optical sensor and Arduino Uno