Python yelp scraper

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.
This is the script:
import requests
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
class YelpScraper:
link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='
def __init__(self, name, location, num="0"):
self.name = quote_plus(name)
self.location = quote_plus(location)
self.num = quote_plus(num)
self.base_url = self.link.format(self.name,self.location,self.num)
self.session = requests.Session()
def get_info(self):
s = self.session
s.headers = 'User-Agent': 'Mozilla/5.0'
while True:
res = s.get(self.base_url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("div.biz-listing-large"):
name = items.select_one(".biz-name span").get_text(strip=True)
try:
phone = items.select_one("span.biz-phone").get_text(strip=True)
except AttributeError: phone = ""
print("Name: nPhone: n".format(name,phone))
link = soup.select_one(".pagination-links .next")
if not link:break
self.base_url = "https://www.yelp.com" + link.get("href")
if __name__ == '__main__':
scrape = YelpScraper("Restaurants","San Francisco, CA")
scrape.get_info()
python object-oriented python-3.x web-scraping
add a comment |Â
up vote
4
down vote
favorite
I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.
This is the script:
import requests
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
class YelpScraper:
link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='
def __init__(self, name, location, num="0"):
self.name = quote_plus(name)
self.location = quote_plus(location)
self.num = quote_plus(num)
self.base_url = self.link.format(self.name,self.location,self.num)
self.session = requests.Session()
def get_info(self):
s = self.session
s.headers = 'User-Agent': 'Mozilla/5.0'
while True:
res = s.get(self.base_url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("div.biz-listing-large"):
name = items.select_one(".biz-name span").get_text(strip=True)
try:
phone = items.select_one("span.biz-phone").get_text(strip=True)
except AttributeError: phone = ""
print("Name: nPhone: n".format(name,phone))
link = soup.select_one(".pagination-links .next")
if not link:break
self.base_url = "https://www.yelp.com" + link.get("href")
if __name__ == '__main__':
scrape = YelpScraper("Restaurants","San Francisco, CA")
scrape.get_info()
python object-oriented python-3.x web-scraping
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.
This is the script:
import requests
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
class YelpScraper:
link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='
def __init__(self, name, location, num="0"):
self.name = quote_plus(name)
self.location = quote_plus(location)
self.num = quote_plus(num)
self.base_url = self.link.format(self.name,self.location,self.num)
self.session = requests.Session()
def get_info(self):
s = self.session
s.headers = 'User-Agent': 'Mozilla/5.0'
while True:
res = s.get(self.base_url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("div.biz-listing-large"):
name = items.select_one(".biz-name span").get_text(strip=True)
try:
phone = items.select_one("span.biz-phone").get_text(strip=True)
except AttributeError: phone = ""
print("Name: nPhone: n".format(name,phone))
link = soup.select_one(".pagination-links .next")
if not link:break
self.base_url = "https://www.yelp.com" + link.get("href")
if __name__ == '__main__':
scrape = YelpScraper("Restaurants","San Francisco, CA")
scrape.get_info()
python object-oriented python-3.x web-scraping
I've written a script in python to parse different names and phone numbers of various restaurant names from yelp.com. The scraper is doing its job just fine. The most important feature of this scraper is that it can handle pagination on the fly (if there is any) no matter how many pages it traverses. I tried to create it following the guidelines of OOP. However, I suppose there are still some options to make it better, as in isolating the while True loop by storing it in another function.
This is the script:
import requests
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
class YelpScraper:
link = 'https://www.yelp.com/search?find_desc=&find_loc=&start='
def __init__(self, name, location, num="0"):
self.name = quote_plus(name)
self.location = quote_plus(location)
self.num = quote_plus(num)
self.base_url = self.link.format(self.name,self.location,self.num)
self.session = requests.Session()
def get_info(self):
s = self.session
s.headers = 'User-Agent': 'Mozilla/5.0'
while True:
res = s.get(self.base_url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select("div.biz-listing-large"):
name = items.select_one(".biz-name span").get_text(strip=True)
try:
phone = items.select_one("span.biz-phone").get_text(strip=True)
except AttributeError: phone = ""
print("Name: nPhone: n".format(name,phone))
link = soup.select_one(".pagination-links .next")
if not link:break
self.base_url = "https://www.yelp.com" + link.get("href")
if __name__ == '__main__':
scrape = YelpScraper("Restaurants","San Francisco, CA")
scrape.get_info()
python object-oriented python-3.x web-scraping
edited Jun 10 at 8:57
Daniel
4,1132836
4,1132836
asked Jun 10 at 8:21
Topto
2158
2158
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
- You don't need to quote parameters yourself,
requestscan do it for you;
You don't need a class for that, a simple function will suffice; IâÂÂd extract retrieving content from a URL as another function though;- Separate logic from presentation: have your function
returna list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator andyieldthe pairs as you go; - There is no need to decode the content before parsing it: the
lxmlparser work best with a sequence of bytes as it can inspect the<head>to use the appropriate encoding.
Proposed improvements:
import requests
from bs4 import BeautifulSoup
def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
params = kwargs if kwargs else None
return session.get(base_url + route, params=params)
def yelp_scraper(name, location, num=0):
session = requests.Session()
session.headers = 'User-Agent': 'Mozilla/5.0'
response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
while True:
soup = BeautifulSoup(response.content, 'lxml')
for items in soup.select('div.biz-listing-large'):
name = items.select_one('.biz-name span').get_text(strip=True)
try:
phone = items.select_one('span.biz-phone').get_text(strip=True)
except AttributeError:
phone = ''
yield name, phone
link = soup.select_one('.pagination-links .next')
if not link:
break
response = url_fetcher(session, link.get('href'))
if __name__ == '__main__':
for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
print('Name:', name)
print('Phone:', phone)
print()
I did not know that thelxmlparser can directly take theresponse.contentand does not needresponse.text!
â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
- You don't need to quote parameters yourself,
requestscan do it for you;
You don't need a class for that, a simple function will suffice; IâÂÂd extract retrieving content from a URL as another function though;- Separate logic from presentation: have your function
returna list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator andyieldthe pairs as you go; - There is no need to decode the content before parsing it: the
lxmlparser work best with a sequence of bytes as it can inspect the<head>to use the appropriate encoding.
Proposed improvements:
import requests
from bs4 import BeautifulSoup
def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
params = kwargs if kwargs else None
return session.get(base_url + route, params=params)
def yelp_scraper(name, location, num=0):
session = requests.Session()
session.headers = 'User-Agent': 'Mozilla/5.0'
response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
while True:
soup = BeautifulSoup(response.content, 'lxml')
for items in soup.select('div.biz-listing-large'):
name = items.select_one('.biz-name span').get_text(strip=True)
try:
phone = items.select_one('span.biz-phone').get_text(strip=True)
except AttributeError:
phone = ''
yield name, phone
link = soup.select_one('.pagination-links .next')
if not link:
break
response = url_fetcher(session, link.get('href'))
if __name__ == '__main__':
for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
print('Name:', name)
print('Phone:', phone)
print()
I did not know that thelxmlparser can directly take theresponse.contentand does not needresponse.text!
â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
add a comment |Â
up vote
2
down vote
- You don't need to quote parameters yourself,
requestscan do it for you;
You don't need a class for that, a simple function will suffice; IâÂÂd extract retrieving content from a URL as another function though;- Separate logic from presentation: have your function
returna list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator andyieldthe pairs as you go; - There is no need to decode the content before parsing it: the
lxmlparser work best with a sequence of bytes as it can inspect the<head>to use the appropriate encoding.
Proposed improvements:
import requests
from bs4 import BeautifulSoup
def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
params = kwargs if kwargs else None
return session.get(base_url + route, params=params)
def yelp_scraper(name, location, num=0):
session = requests.Session()
session.headers = 'User-Agent': 'Mozilla/5.0'
response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
while True:
soup = BeautifulSoup(response.content, 'lxml')
for items in soup.select('div.biz-listing-large'):
name = items.select_one('.biz-name span').get_text(strip=True)
try:
phone = items.select_one('span.biz-phone').get_text(strip=True)
except AttributeError:
phone = ''
yield name, phone
link = soup.select_one('.pagination-links .next')
if not link:
break
response = url_fetcher(session, link.get('href'))
if __name__ == '__main__':
for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
print('Name:', name)
print('Phone:', phone)
print()
I did not know that thelxmlparser can directly take theresponse.contentand does not needresponse.text!
â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
add a comment |Â
up vote
2
down vote
up vote
2
down vote
- You don't need to quote parameters yourself,
requestscan do it for you;
You don't need a class for that, a simple function will suffice; IâÂÂd extract retrieving content from a URL as another function though;- Separate logic from presentation: have your function
returna list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator andyieldthe pairs as you go; - There is no need to decode the content before parsing it: the
lxmlparser work best with a sequence of bytes as it can inspect the<head>to use the appropriate encoding.
Proposed improvements:
import requests
from bs4 import BeautifulSoup
def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
params = kwargs if kwargs else None
return session.get(base_url + route, params=params)
def yelp_scraper(name, location, num=0):
session = requests.Session()
session.headers = 'User-Agent': 'Mozilla/5.0'
response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
while True:
soup = BeautifulSoup(response.content, 'lxml')
for items in soup.select('div.biz-listing-large'):
name = items.select_one('.biz-name span').get_text(strip=True)
try:
phone = items.select_one('span.biz-phone').get_text(strip=True)
except AttributeError:
phone = ''
yield name, phone
link = soup.select_one('.pagination-links .next')
if not link:
break
response = url_fetcher(session, link.get('href'))
if __name__ == '__main__':
for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
print('Name:', name)
print('Phone:', phone)
print()
- You don't need to quote parameters yourself,
requestscan do it for you;
You don't need a class for that, a simple function will suffice; IâÂÂd extract retrieving content from a URL as another function though;- Separate logic from presentation: have your function
returna list of name/phone pairs and have the calling code responsible of printing it. Better, turn the function into a generator andyieldthe pairs as you go; - There is no need to decode the content before parsing it: the
lxmlparser work best with a sequence of bytes as it can inspect the<head>to use the appropriate encoding.
Proposed improvements:
import requests
from bs4 import BeautifulSoup
def url_fetcher(session, route, base_url='https://www.yelp.com', **kwargs):
params = kwargs if kwargs else None
return session.get(base_url + route, params=params)
def yelp_scraper(name, location, num=0):
session = requests.Session()
session.headers = 'User-Agent': 'Mozilla/5.0'
response = url_fetcher(session, '/search', find_desc=name, find_loc=location, start=num)
while True:
soup = BeautifulSoup(response.content, 'lxml')
for items in soup.select('div.biz-listing-large'):
name = items.select_one('.biz-name span').get_text(strip=True)
try:
phone = items.select_one('span.biz-phone').get_text(strip=True)
except AttributeError:
phone = ''
yield name, phone
link = soup.select_one('.pagination-links .next')
if not link:
break
response = url_fetcher(session, link.get('href'))
if __name__ == '__main__':
for name, phone in yelp_scraper('Restaurants', 'San Francisco, CA'):
print('Name:', name)
print('Phone:', phone)
print()
edited Jun 11 at 9:44
answered Jun 11 at 9:38
Mathias Ettinger
21.8k32875
21.8k32875
I did not know that thelxmlparser can directly take theresponse.contentand does not needresponse.text!
â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
add a comment |Â
I did not know that thelxmlparser can directly take theresponse.contentand does not needresponse.text!
â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
I did not know that the
lxml parser can directly take the response.content and does not need response.text!â Graipher
Jun 11 at 10:05
I did not know that the
lxml parser can directly take the response.content and does not need response.text!â Graipher
Jun 11 at 10:05
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
@Graipher reading from the docs, there might even be a possibility that the raw response object can be used instead: lxml.de/tutorial.html#the-parse-function
â Mathias Ettinger
Jun 11 at 17:01
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
Thanks @Mathias Ettinger for showing a new way of doing things. My intention was to do the same using class as I'm a new to OOP. If it were not for class I could have done the same using a single function. Povided +1. Thanks.
â Topto
Jun 12 at 17:24
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196218%2fpython-yelp-scraper%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password