Python yelp scraper

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.



This is the script:



import requests
from urllib.parse import quote_plus
from bs4 import BeautifulSoup

class YelpScraper:
link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='

def __init__(self, name, location, num="0"):
self.name = quote_plus(name)
self.location = quote_plus(location)
self.num = quote_plus(num)
self.base_url = self.link.format(self.name,self.location,self.num)
self.session = requests.Session()

def get_info(self):
s = self.session
s.headers = 'User-Agent': 'Mozilla/5.0'

while True:
res = s.get(self.base_url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("div.biz-listing-large"):
name = items.select_one(".biz-name span").get_text(strip=True)
try:
phone = items.select_one("span.biz-phone").get_text(strip=True)
except AttributeError: phone = ""
print("Name: nPhone: n".format(name,phone))

link = soup.select_one(".pagination-links .next")
if not link:break
self.base_url = "https://www.yelp.com" + link.get("href")

if __name__ == '__main__':
scrape = YelpScraper("Restaurants","San Francisco, CA")
scrape.get_info()






share|improve this question



























    up vote
    4
    down vote

    favorite












    I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.



    This is the script:



    import requests
    from urllib.parse import quote_plus
    from bs4 import BeautifulSoup

    class YelpScraper:
    link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='

    def __init__(self, name, location, num="0"):
    self.name = quote_plus(name)
    self.location = quote_plus(location)
    self.num = quote_plus(num)
    self.base_url = self.link.format(self.name,self.location,self.num)
    self.session = requests.Session()

    def get_info(self):
    s = self.session
    s.headers = 'User-Agent': 'Mozilla/5.0'

    while True:
    res = s.get(self.base_url)
    soup = BeautifulSoup(res.text, "lxml")
    for items in soup.select("div.biz-listing-large"):
    name = items.select_one(".biz-name span").get_text(strip=True)
    try:
    phone = items.select_one("span.biz-phone").get_text(strip=True)
    except AttributeError: phone = ""
    print("Name: nPhone: n".format(name,phone))

    link = soup.select_one(".pagination-links .next")
    if not link:break
    self.base_url = "https://www.yelp.com" + link.get("href")

    if __name__ == '__main__':
    scrape = YelpScraper("Restaurants","San Francisco, CA")
    scrape.get_info()






    share|improve this question























      up vote
      4
      down vote

      favorite









      up vote
      4
      down vote

      favorite











      I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.



      This is the script:



      import requests
      from urllib.parse import quote_plus
      from bs4 import BeautifulSoup

      class YelpScraper:
      link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='

      def __init__(self, name, location, num="0"):
      self.name = quote_plus(name)
      self.location = quote_plus(location)
      self.num = quote_plus(num)
      self.base_url = self.link.format(self.name,self.location,self.num)
      self.session = requests.Session()

      def get_info(self):
      s = self.session
      s.headers = 'User-Agent': 'Mozilla/5.0'

      while True:
      res = s.get(self.base_url)
      soup = BeautifulSoup(res.text, "lxml")
      for items in soup.select("div.biz-listing-large"):
      name = items.select_one(".biz-name span").get_text(strip=True)
      try:
      phone = items.select_one("span.biz-phone").get_text(strip=True)
      except AttributeError: phone = ""
      print("Name: nPhone: n".format(name,phone))

      link = soup.select_one(".pagination-links .next")
      if not link:break
      self.base_url = "https://www.yelp.com" + link.get("href")

      if __name__ == '__main__':
      scrape = YelpScraper("Restaurants","San Francisco, CA")
      scrape.get_info()






      share|improve this question













      I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.



      This is the script:



      import requests
      from urllib.parse import quote_plus
      from bs4 import BeautifulSoup

      class YelpScraper:
      link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='

      def __init__(self, name, location, num="0"):
      self.name = quote_plus(name)
      self.location = quote_plus(location)
      self.num = quote_plus(num)
      self.base_url = self.link.format(self.name,self.location,self.num)
      self.session = requests.Session()

      def get_info(self):
      s = self.session
      s.headers = 'User-Agent': 'Mozilla/5.0'

      while True:
      res = s.get(self.base_url)
      soup = BeautifulSoup(res.text, "lxml")
      for items in soup.select("div.biz-listing-large"):
      name = items.select_one(".biz-name span").get_text(strip=True)
      try:
      phone = items.select_one("span.biz-phone").get_text(strip=True)
      except AttributeError: phone = ""
      print("Name: nPhone: n".format(name,phone))

      link = soup.select_one(".pagination-links .next")
      if not link:break
      self.base_url = "https://www.yelp.com" + link.get("href")

      if __name__ == '__main__':
      scrape = YelpScraper("Restaurants","San Francisco, CA")
      scrape.get_info()








      share|improve this question












      share|improve this question




      share|improve this question








      edited Jun 10 at 8:57









      Daniel

      4,1132836




      4,1132836









      asked Jun 10 at 8:21









      Topto

      2158




      2158




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote













          1. You don't need to quote parameters yourself, requests can do it for you;


          2. You don't need a class for that, a simple function will suffice; I’d extract retrieving content from a URL as another function though;

          3. Separate logic from presentation: have your function return a list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator and yield the pairs as you go;

          4. There is no need to decode the content before parsing it: the lxml parser work best with a sequence of bytes as it can inspect the <head> to use the appropriate encoding.

          Proposed improvements:



          import requests
          from bs4 import BeautifulSoup


          def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
          params = kwargs if kwargs else None

          return session.get(base_url + route, params=params)


          def yelp_scraper(name, location, num=0):
          session = requests.Session()
          session.headers = 'User-Agent': 'Mozilla/5.0'

          response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
          while True:
          soup = BeautifulSoup(response.content, 'lxml')
          for items in soup.select('div.biz-listing-large'):
          name = items.select_one('.biz-name span').get_text(strip=True)
          try:
          phone = items.select_one('span.biz-phone').get_text(strip=True)
          except AttributeError:
          phone = ''
          yield name, phone

          link = soup.select_one('.pagination-links .next')
          if not link:
          break
          response = url_fetcher(session, link.get('href'))


          if __name__ == '__main__':
          for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
          print('Name:', name)
          print('Phone:', phone)
          print()





          share|improve this answer























          • I did not know that the lxml parser can directly take the response.content and does not need response.text!
            – Graipher
            Jun 11 at 10:05










          • @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
            – Mathias Ettinger
            Jun 11 at 17:01










          • Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
            – Topto
            Jun 12 at 17:24










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196218%2fpython-yelp-scraper%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote













          1. You don't need to quote parameters yourself, requests can do it for you;


          2. You don't need a class for that, a simple function will suffice; I’d extract retrieving content from a URL as another function though;

          3. Separate logic from presentation: have your function return a list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator and yield the pairs as you go;

          4. There is no need to decode the content before parsing it: the lxml parser work best with a sequence of bytes as it can inspect the <head> to use the appropriate encoding.

          Proposed improvements:



          import requests
          from bs4 import BeautifulSoup


          def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
          params = kwargs if kwargs else None

          return session.get(base_url + route, params=params)


          def yelp_scraper(name, location, num=0):
          session = requests.Session()
          session.headers = 'User-Agent': 'Mozilla/5.0'

          response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
          while True:
          soup = BeautifulSoup(response.content, 'lxml')
          for items in soup.select('div.biz-listing-large'):
          name = items.select_one('.biz-name span').get_text(strip=True)
          try:
          phone = items.select_one('span.biz-phone').get_text(strip=True)
          except AttributeError:
          phone = ''
          yield name, phone

          link = soup.select_one('.pagination-links .next')
          if not link:
          break
          response = url_fetcher(session, link.get('href'))


          if __name__ == '__main__':
          for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
          print('Name:', name)
          print('Phone:', phone)
          print()





          share|improve this answer























          • I did not know that the lxml parser can directly take the response.content and does not need response.text!
            – Graipher
            Jun 11 at 10:05










          • @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
            – Mathias Ettinger
            Jun 11 at 17:01










          • Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
            – Topto
            Jun 12 at 17:24














          up vote
          2
          down vote













          1. You don't need to quote parameters yourself, requests can do it for you;


          2. You don't need a class for that, a simple function will suffice; I’d extract retrieving content from a URL as another function though;

          3. Separate logic from presentation: have your function return a list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator and yield the pairs as you go;

          4. There is no need to decode the content before parsing it: the lxml parser work best with a sequence of bytes as it can inspect the <head> to use the appropriate encoding.

          Proposed improvements:



          import requests
          from bs4 import BeautifulSoup


          def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
          params = kwargs if kwargs else None

          return session.get(base_url + route, params=params)


          def yelp_scraper(name, location, num=0):
          session = requests.Session()
          session.headers = 'User-Agent': 'Mozilla/5.0'

          response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
          while True:
          soup = BeautifulSoup(response.content, 'lxml')
          for items in soup.select('div.biz-listing-large'):
          name = items.select_one('.biz-name span').get_text(strip=True)
          try:
          phone = items.select_one('span.biz-phone').get_text(strip=True)
          except AttributeError:
          phone = ''
          yield name, phone

          link = soup.select_one('.pagination-links .next')
          if not link:
          break
          response = url_fetcher(session, link.get('href'))


          if __name__ == '__main__':
          for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
          print('Name:', name)
          print('Phone:', phone)
          print()





          share|improve this answer























          • I did not know that the lxml parser can directly take the response.content and does not need response.text!
            – Graipher
            Jun 11 at 10:05










          • @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
            – Mathias Ettinger
            Jun 11 at 17:01










          • Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
            – Topto
            Jun 12 at 17:24












          up vote
          2
          down vote










          up vote
          2
          down vote









          1. You don't need to quote parameters yourself, requests can do it for you;


          2. You don't need a class for that, a simple function will suffice; I’d extract retrieving content from a URL as another function though;

          3. Separate logic from presentation: have your function return a list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator and yield the pairs as you go;

          4. There is no need to decode the content before parsing it: the lxml parser work best with a sequence of bytes as it can inspect the <head> to use the appropriate encoding.

          Proposed improvements:



          import requests
          from bs4 import BeautifulSoup


          def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
          params = kwargs if kwargs else None

          return session.get(base_url + route, params=params)


          def yelp_scraper(name, location, num=0):
          session = requests.Session()
          session.headers = 'User-Agent': 'Mozilla/5.0'

          response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
          while True:
          soup = BeautifulSoup(response.content, 'lxml')
          for items in soup.select('div.biz-listing-large'):
          name = items.select_one('.biz-name span').get_text(strip=True)
          try:
          phone = items.select_one('span.biz-phone').get_text(strip=True)
          except AttributeError:
          phone = ''
          yield name, phone

          link = soup.select_one('.pagination-links .next')
          if not link:
          break
          response = url_fetcher(session, link.get('href'))


          if __name__ == '__main__':
          for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
          print('Name:', name)
          print('Phone:', phone)
          print()





          share|improve this answer















          1. You don't need to quote parameters yourself, requests can do it for you;


          2. You don't need a class for that, a simple function will suffice; I’d extract retrieving content from a URL as another function though;

          3. Separate logic from presentation: have your function return a list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator and yield the pairs as you go;

          4. There is no need to decode the content before parsing it: the lxml parser work best with a sequence of bytes as it can inspect the <head> to use the appropriate encoding.

          Proposed improvements:



          import requests
          from bs4 import BeautifulSoup


          def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
          params = kwargs if kwargs else None

          return session.get(base_url + route, params=params)


          def yelp_scraper(name, location, num=0):
          session = requests.Session()
          session.headers = 'User-Agent': 'Mozilla/5.0'

          response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
          while True:
          soup = BeautifulSoup(response.content, 'lxml')
          for items in soup.select('div.biz-listing-large'):
          name = items.select_one('.biz-name span').get_text(strip=True)
          try:
          phone = items.select_one('span.biz-phone').get_text(strip=True)
          except AttributeError:
          phone = ''
          yield name, phone

          link = soup.select_one('.pagination-links .next')
          if not link:
          break
          response = url_fetcher(session, link.get('href'))


          if __name__ == '__main__':
          for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
          print('Name:', name)
          print('Phone:', phone)
          print()






          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Jun 11 at 9:44


























          answered Jun 11 at 9:38









          Mathias Ettinger

          21.8k32875




          21.8k32875











          • I did not know that the lxml parser can directly take the response.content and does not need response.text!
            – Graipher
            Jun 11 at 10:05










          • @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
            – Mathias Ettinger
            Jun 11 at 17:01










          • Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
            – Topto
            Jun 12 at 17:24
















          • I did not know that the lxml parser can directly take the response.content and does not need response.text!
            – Graipher
            Jun 11 at 10:05










          • @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
            – Mathias Ettinger
            Jun 11 at 17:01










          • Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
            – Topto
            Jun 12 at 17:24















          I did not know that the lxml parser can directly take the response.content and does not need response.text!
          – Graipher
          Jun 11 at 10:05




          I did not know that the lxml parser can directly take the response.content and does not need response.text!
          – Graipher
          Jun 11 at 10:05












          @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
          – Mathias Ettinger
          Jun 11 at 17:01




          @Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
          – Mathias Ettinger
          Jun 11 at 17:01












          Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
          – Topto
          Jun 12 at 17:24




          Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
          – Topto
          Jun 12 at 17:24












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196218%2fpython-yelp-scraper%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Python Lists

          Aion

          JavaScript Array Iteration Methods