Using rotation of proxies within a Python script
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
6
down vote
favorite
I've written a script in Python using rotation of proxies which is able to fetch different titles from a website traversing different pages. I've tried to write this scraper in such a way so that it will try using proxies randomly until it can get the titles from that webpage, meaning it will use every single proxy cyclically if there is any difficulty getting the valid response.
The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.
As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.
I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
ua = UserAgent()
search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]
def get_proxies():
proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
return proxies
def check_proxy(session,proxy):
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
try:
response = session.get('https://httpbin.org/ip')
item = response.json()
print(item)
return 0 ##if the proxy is a working one, break out of the function
except Exception:
proxy = random_proxy()
check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one
def random_proxy():
return choice(get_proxies())
def scrape_page(url):
proxy = random_proxy()
session = requests.Session()
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
check_proxy(session,proxy) ##try validating the proxy before further attempt
try:
response = session.get(url)
except Exception:
response = None #preventing "UnboundLocalError:"
check_proxy(session,proxy) #if try block failed to execute the response, activate it
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
if __name__ == '__main__':
for link in search_urls:
scrape_page(link)
python python-3.x error-handling web-scraping proxy
add a comment |Â
up vote
6
down vote
favorite
I've written a script in Python using rotation of proxies which is able to fetch different titles from a website traversing different pages. I've tried to write this scraper in such a way so that it will try using proxies randomly until it can get the titles from that webpage, meaning it will use every single proxy cyclically if there is any difficulty getting the valid response.
The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.
As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.
I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
ua = UserAgent()
search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]
def get_proxies():
proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
return proxies
def check_proxy(session,proxy):
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
try:
response = session.get('https://httpbin.org/ip')
item = response.json()
print(item)
return 0 ##if the proxy is a working one, break out of the function
except Exception:
proxy = random_proxy()
check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one
def random_proxy():
return choice(get_proxies())
def scrape_page(url):
proxy = random_proxy()
session = requests.Session()
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
check_proxy(session,proxy) ##try validating the proxy before further attempt
try:
response = session.get(url)
except Exception:
response = None #preventing "UnboundLocalError:"
check_proxy(session,proxy) #if try block failed to execute the response, activate it
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
if __name__ == '__main__':
for link in search_urls:
scrape_page(link)
python python-3.x error-handling web-scraping proxy
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I've written a script in Python using rotation of proxies which is able to fetch different titles from a website traversing different pages. I've tried to write this scraper in such a way so that it will try using proxies randomly until it can get the titles from that webpage, meaning it will use every single proxy cyclically if there is any difficulty getting the valid response.
The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.
As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.
I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
ua = UserAgent()
search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]
def get_proxies():
proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
return proxies
def check_proxy(session,proxy):
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
try:
response = session.get('https://httpbin.org/ip')
item = response.json()
print(item)
return 0 ##if the proxy is a working one, break out of the function
except Exception:
proxy = random_proxy()
check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one
def random_proxy():
return choice(get_proxies())
def scrape_page(url):
proxy = random_proxy()
session = requests.Session()
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
check_proxy(session,proxy) ##try validating the proxy before further attempt
try:
response = session.get(url)
except Exception:
response = None #preventing "UnboundLocalError:"
check_proxy(session,proxy) #if try block failed to execute the response, activate it
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
if __name__ == '__main__':
for link in search_urls:
scrape_page(link)
python python-3.x error-handling web-scraping proxy
I've written a script in Python using rotation of proxies which is able to fetch different titles from a website traversing different pages. I've tried to write this scraper in such a way so that it will try using proxies randomly until it can get the titles from that webpage, meaning it will use every single proxy cyclically if there is any difficulty getting the valid response.
The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.
As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.
I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
ua = UserAgent()
search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]
def get_proxies():
proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
return proxies
def check_proxy(session,proxy):
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
try:
response = session.get('https://httpbin.org/ip')
item = response.json()
print(item)
return 0 ##if the proxy is a working one, break out of the function
except Exception:
proxy = random_proxy()
check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one
def random_proxy():
return choice(get_proxies())
def scrape_page(url):
proxy = random_proxy()
session = requests.Session()
session.headers = 'User-Agent', ua.random
session.proxies = 'https': 'https://'.format(proxy)
check_proxy(session,proxy) ##try validating the proxy before further attempt
try:
response = session.get(url)
except Exception:
response = None #preventing "UnboundLocalError:"
check_proxy(session,proxy) #if try block failed to execute the response, activate it
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
if __name__ == '__main__':
for link in search_urls:
scrape_page(link)
python python-3.x error-handling web-scraping proxy
edited May 22 at 23:47
Jamalâ¦
30.1k11114225
30.1k11114225
asked May 22 at 22:01
MITHU
32019
32019
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
Every time you call scrape_page()
with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?
If many requests fail â for example, if your network is down â then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.
The code in check_proxy()
is a bit redundant with the code in scrape_page()
. Also, check_proxy()
is inappropriately recursive. I would create a set_proxy()
function that has a more comprehensive mission.
Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse
PROXIES = [
'https://128.199.254.244:3128',
'https://95.85.79.54:53281',
'https://128.199.125.54:2468',
'https://178.45.8.113:53281',
'https://206.189.225.30:3128',
]
def set_proxy(session, proxy_candidates=PROXIES, verify=False):
"""
Configure the session to use one of the proxy_candidates. If verify is
True, then the proxy will have been verified to work.
"""
while True:
proxy = choice(proxy_candidates)
session.proxies = urlparse(proxy).scheme: proxy
if not verify:
return
try:
print(session.get('https://httpbin.org/ip').json())
return
except Exception:
pass
def scrape_page(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
set_proxy(session)
while True:
try:
response = session.get(url)
break
except Exception as e:
session.headers = 'User-Agent': ua.random
set_proxy(session, verify=True)
sleep(0.1)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should complysession.headers = 'User-Agent', ua.random
orsession.headers = 'User-Agent': ua.random
? Notice the,
and:
. Gonna accept it in a while.
â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined.set_proxy()
function is self dependant. Ain't it? I get confused to see thedocstring
as it started with the lineConfigure the session
. Do I need to define anything over there? Thanks a zillion once again.
â MITHU
May 24 at 21:20
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Every time you call scrape_page()
with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?
If many requests fail â for example, if your network is down â then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.
The code in check_proxy()
is a bit redundant with the code in scrape_page()
. Also, check_proxy()
is inappropriately recursive. I would create a set_proxy()
function that has a more comprehensive mission.
Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse
PROXIES = [
'https://128.199.254.244:3128',
'https://95.85.79.54:53281',
'https://128.199.125.54:2468',
'https://178.45.8.113:53281',
'https://206.189.225.30:3128',
]
def set_proxy(session, proxy_candidates=PROXIES, verify=False):
"""
Configure the session to use one of the proxy_candidates. If verify is
True, then the proxy will have been verified to work.
"""
while True:
proxy = choice(proxy_candidates)
session.proxies = urlparse(proxy).scheme: proxy
if not verify:
return
try:
print(session.get('https://httpbin.org/ip').json())
return
except Exception:
pass
def scrape_page(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
set_proxy(session)
while True:
try:
response = session.get(url)
break
except Exception as e:
session.headers = 'User-Agent': ua.random
set_proxy(session, verify=True)
sleep(0.1)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should complysession.headers = 'User-Agent', ua.random
orsession.headers = 'User-Agent': ua.random
? Notice the,
and:
. Gonna accept it in a while.
â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined.set_proxy()
function is self dependant. Ain't it? I get confused to see thedocstring
as it started with the lineConfigure the session
. Do I need to define anything over there? Thanks a zillion once again.
â MITHU
May 24 at 21:20
add a comment |Â
up vote
3
down vote
accepted
Every time you call scrape_page()
with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?
If many requests fail â for example, if your network is down â then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.
The code in check_proxy()
is a bit redundant with the code in scrape_page()
. Also, check_proxy()
is inappropriately recursive. I would create a set_proxy()
function that has a more comprehensive mission.
Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse
PROXIES = [
'https://128.199.254.244:3128',
'https://95.85.79.54:53281',
'https://128.199.125.54:2468',
'https://178.45.8.113:53281',
'https://206.189.225.30:3128',
]
def set_proxy(session, proxy_candidates=PROXIES, verify=False):
"""
Configure the session to use one of the proxy_candidates. If verify is
True, then the proxy will have been verified to work.
"""
while True:
proxy = choice(proxy_candidates)
session.proxies = urlparse(proxy).scheme: proxy
if not verify:
return
try:
print(session.get('https://httpbin.org/ip').json())
return
except Exception:
pass
def scrape_page(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
set_proxy(session)
while True:
try:
response = session.get(url)
break
except Exception as e:
session.headers = 'User-Agent': ua.random
set_proxy(session, verify=True)
sleep(0.1)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should complysession.headers = 'User-Agent', ua.random
orsession.headers = 'User-Agent': ua.random
? Notice the,
and:
. Gonna accept it in a while.
â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined.set_proxy()
function is self dependant. Ain't it? I get confused to see thedocstring
as it started with the lineConfigure the session
. Do I need to define anything over there? Thanks a zillion once again.
â MITHU
May 24 at 21:20
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Every time you call scrape_page()
with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?
If many requests fail â for example, if your network is down â then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.
The code in check_proxy()
is a bit redundant with the code in scrape_page()
. Also, check_proxy()
is inappropriately recursive. I would create a set_proxy()
function that has a more comprehensive mission.
Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse
PROXIES = [
'https://128.199.254.244:3128',
'https://95.85.79.54:53281',
'https://128.199.125.54:2468',
'https://178.45.8.113:53281',
'https://206.189.225.30:3128',
]
def set_proxy(session, proxy_candidates=PROXIES, verify=False):
"""
Configure the session to use one of the proxy_candidates. If verify is
True, then the proxy will have been verified to work.
"""
while True:
proxy = choice(proxy_candidates)
session.proxies = urlparse(proxy).scheme: proxy
if not verify:
return
try:
print(session.get('https://httpbin.org/ip').json())
return
except Exception:
pass
def scrape_page(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
set_proxy(session)
while True:
try:
response = session.get(url)
break
except Exception as e:
session.headers = 'User-Agent': ua.random
set_proxy(session, verify=True)
sleep(0.1)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
Every time you call scrape_page()
with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?
If many requests fail â for example, if your network is down â then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.
The code in check_proxy()
is a bit redundant with the code in scrape_page()
. Also, check_proxy()
is inappropriately recursive. I would create a set_proxy()
function that has a more comprehensive mission.
Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse
PROXIES = [
'https://128.199.254.244:3128',
'https://95.85.79.54:53281',
'https://128.199.125.54:2468',
'https://178.45.8.113:53281',
'https://206.189.225.30:3128',
]
def set_proxy(session, proxy_candidates=PROXIES, verify=False):
"""
Configure the session to use one of the proxy_candidates. If verify is
True, then the proxy will have been verified to work.
"""
while True:
proxy = choice(proxy_candidates)
session.proxies = urlparse(proxy).scheme: proxy
if not verify:
return
try:
print(session.get('https://httpbin.org/ip').json())
return
except Exception:
pass
def scrape_page(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
set_proxy(session)
while True:
try:
response = session.get(url)
break
except Exception as e:
session.headers = 'User-Agent': ua.random
set_proxy(session, verify=True)
sleep(0.1)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".question-hyperlink"):
print(items.text)
edited May 24 at 16:32
Daniel
4,1132836
4,1132836
answered May 22 at 23:45
200_success
123k14143399
123k14143399
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should complysession.headers = 'User-Agent', ua.random
orsession.headers = 'User-Agent': ua.random
? Notice the,
and:
. Gonna accept it in a while.
â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined.set_proxy()
function is self dependant. Ain't it? I get confused to see thedocstring
as it started with the lineConfigure the session
. Do I need to define anything over there? Thanks a zillion once again.
â MITHU
May 24 at 21:20
add a comment |Â
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should complysession.headers = 'User-Agent', ua.random
orsession.headers = 'User-Agent': ua.random
? Notice the,
and:
. Gonna accept it in a while.
â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined.set_proxy()
function is self dependant. Ain't it? I get confused to see thedocstring
as it started with the lineConfigure the session
. Do I need to define anything over there? Thanks a zillion once again.
â MITHU
May 24 at 21:20
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply
session.headers = 'User-Agent', ua.random
or session.headers = 'User-Agent': ua.random
? Notice the ,
and :
. Gonna accept it in a while.â MITHU
May 24 at 15:40
It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply
session.headers = 'User-Agent', ua.random
or session.headers = 'User-Agent': ua.random
? Notice the ,
and :
. Gonna accept it in a while.â MITHU
May 24 at 15:40
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â 200_success
May 24 at 16:20
@ 200_success, One last thing to know for the clarity: It seems your defined
.set_proxy()
function is self dependant. Ain't it? I get confused to see the docstring
as it started with the line Configure the session
. Do I need to define anything over there? Thanks a zillion once again.â MITHU
May 24 at 21:20
@ 200_success, One last thing to know for the clarity: It seems your defined
.set_proxy()
function is self dependant. Ain't it? I get confused to see the docstring
as it started with the line Configure the session
. Do I need to define anything over there? Thanks a zillion once again.â MITHU
May 24 at 21:20
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194977%2fusing-rotation-of-proxies-within-a-python-script%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password