Web-scraping through a rotating proxy script
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.
I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.
This is the complete approach:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]
def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"
def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass
while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return
def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response
while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
for link in links:
parse_content(link)
python python-3.x web-scraping beautifulsoup proxy
add a comment |Â
up vote
2
down vote
favorite
I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.
I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.
This is the complete approach:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]
def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"
def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass
while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return
def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response
while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
for link in links:
parse_content(link)
python python-3.x web-scraping beautifulsoup proxy
1
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
1
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.
I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.
This is the complete approach:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]
def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"
def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass
while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return
def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response
while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
for link in links:
parse_content(link)
python python-3.x web-scraping beautifulsoup proxy
I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.
I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.
This is the complete approach:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]
def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"
def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass
while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return
def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response
while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
for link in links:
parse_content(link)
python python-3.x web-scraping beautifulsoup proxy
asked Jun 4 at 21:11
Topto
2158
2158
1
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
1
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20
add a comment |Â
1
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
1
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20
1
1
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
1
1
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
From your description you want your code to perform these tasks:
- Get a list of proxies
- That support https
- That are actually working
And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).
I would use a couple of generators for that:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle
def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)
def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy
def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))
def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again
def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)
This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter
outside of parse_content
and feed it all the way down to get_proxy
.
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
From your description you want your code to perform these tasks:
- Get a list of proxies
- That support https
- That are actually working
And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).
I would use a couple of generators for that:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle
def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)
def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy
def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))
def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again
def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)
This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter
outside of parse_content
and feed it all the way down to get_proxy
.
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
add a comment |Â
up vote
2
down vote
accepted
From your description you want your code to perform these tasks:
- Get a list of proxies
- That support https
- That are actually working
And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).
I would use a couple of generators for that:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle
def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)
def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy
def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))
def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again
def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)
This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter
outside of parse_content
and feed it all the way down to get_proxy
.
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
From your description you want your code to perform these tasks:
- Get a list of proxies
- That support https
- That are actually working
And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).
I would use a couple of generators for that:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle
def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)
def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy
def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))
def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again
def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)
This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter
outside of parse_content
and feed it all the way down to get_proxy
.
From your description you want your code to perform these tasks:
- Get a list of proxies
- That support https
- That are actually working
And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).
I would use a couple of generators for that:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle
def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)
def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy
def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))
def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again
def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)
if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)
This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter
outside of parse_content
and feed it all the way down to get_proxy
.
edited Jun 5 at 12:01
answered Jun 5 at 9:32
Graipher
20.4k42981
20.4k42981
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
add a comment |Â
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
This is so damn perfect. Thanks @Graipher for such a refined script.
â Topto
Jun 5 at 11:58
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195842%2fweb-scraping-through-a-rotating-proxy-script%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
â Reinderien
Jun 5 at 1:19
1
Why are you reassigning session.proxies?
â Reinderien
Jun 5 at 1:20