Using rotation of proxies within a Python script

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
6
down vote

favorite

I've written a script in Python using rotation of proxies which is able to fetch different titles from a website traversing different pages. I've tried to write this scraper in such a way so that it will try using proxies randomly until it can get the titles from that webpage, meaning it will use every single proxy cyclically if there is any difficulty getting the valid response.

The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.

As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.

I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

ua = UserAgent()

search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]

def get_proxies():
 proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
 return proxies

def check_proxy(session,proxy):
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 try:
 response = session.get('https://httpbin.org/ip')
 item = response.json()
 print(item)
 return 0 ##if the proxy is a working one, break out of the function
 except Exception:
 proxy = random_proxy()
 check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one

def random_proxy():
 return choice(get_proxies())

def scrape_page(url):
 proxy = random_proxy() 
 session = requests.Session()
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 check_proxy(session,proxy) ##try validating the proxy before further attempt 

 try:
 response = session.get(url)
 except Exception:
 response = None #preventing "UnboundLocalError:"
 check_proxy(session,proxy) #if try block failed to execute the response, activate it

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

if __name__ == '__main__':
 for link in search_urls:
 scrape_page(link)

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

asked May 22 at 22:01

MITHU

32019

add a commentÂ |Â

up vote
6
down vote

favorite

The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.

As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.

I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

ua = UserAgent()

search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]

def get_proxies():
 proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
 return proxies

def check_proxy(session,proxy):
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 try:
 response = session.get('https://httpbin.org/ip')
 item = response.json()
 print(item)
 return 0 ##if the proxy is a working one, break out of the function
 except Exception:
 proxy = random_proxy()
 check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one

def random_proxy():
 return choice(get_proxies())

def scrape_page(url):
 proxy = random_proxy() 
 session = requests.Session()
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 check_proxy(session,proxy) ##try validating the proxy before further attempt 

 try:
 response = session.get(url)
 except Exception:
 response = None #preventing "UnboundLocalError:"
 check_proxy(session,proxy) #if try block failed to execute the response, activate it

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

if __name__ == '__main__':
 for link in search_urls:
 scrape_page(link)

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

asked May 22 at 22:01

MITHU

32019

add a commentÂ |Â

up vote
6
down vote

favorite

The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.

As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.

I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

ua = UserAgent()

search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]

def get_proxies():
 proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
 return proxies

def check_proxy(session,proxy):
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 try:
 response = session.get('https://httpbin.org/ip')
 item = response.json()
 print(item)
 return 0 ##if the proxy is a working one, break out of the function
 except Exception:
 proxy = random_proxy()
 check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one

def random_proxy():
 return choice(get_proxies())

def scrape_page(url):
 proxy = random_proxy() 
 session = requests.Session()
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 check_proxy(session,proxy) ##try validating the proxy before further attempt 

 try:
 response = session.get(url)
 except Exception:
 response = None #preventing "UnboundLocalError:"
 check_proxy(session,proxy) #if try block failed to execute the response, activate it

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

if __name__ == '__main__':
 for link in search_urls:
 scrape_page(link)

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

asked May 22 at 22:01

MITHU

32019

The proxies and site address I've used within my scraper are just placeholders to let you know how I'm trying to do this.

As I do not have much experience about the usage of rotation of proxies within a scraper, there may be flaws within the design. It is working errorlessly, though.

I will be very glad if I get any suggestions as to how I can improve this existing script to make it more robust.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

ua = UserAgent()

search_urls = ['https://stackoverflow.com/questions?page=&sort=newest'.format(page) for page in range(1,3)]

def get_proxies():
 proxies = ['128.199.254.244:3128', '95.85.79.54:53281', '128.199.125.54:2468', '178.45.8.113:53281', '206.189.225.30:3128']
 return proxies

def check_proxy(session,proxy):
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 try:
 response = session.get('https://httpbin.org/ip')
 item = response.json()
 print(item)
 return 0 ##if the proxy is a working one, break out of the function
 except Exception:
 proxy = random_proxy()
 check_proxy(session,proxy) ##if the earlier one is not working, try moving on to fetch the working one

def random_proxy():
 return choice(get_proxies())

def scrape_page(url):
 proxy = random_proxy() 
 session = requests.Session()
 session.headers = 'User-Agent', ua.random
 session.proxies = 'https': 'https://'.format(proxy)
 check_proxy(session,proxy) ##try validating the proxy before further attempt 

 try:
 response = session.get(url)
 except Exception:
 response = None #preventing "UnboundLocalError:"
 check_proxy(session,proxy) #if try block failed to execute the response, activate it

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

if __name__ == '__main__':
 for link in search_urls:
 scrape_page(link)

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

asked May 22 at 22:01

MITHU

32019

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

edited May 22 at 23:47

Jamalâ™¦

30.1k11114225

asked May 22 at 22:01

MITHU

32019

asked May 22 at 22:01

MITHU

32019

asked May 22 at 22:01

MITHU

32019

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

Every time you call scrape_page() with some URL, you end up making at least two requests: at least one request to verify that the randomly chosen proxy works, and then to make the main request. Isn't that overhead a bit excessive? Why not optimistically assume that a proxy works, and verify the proxy only if the main request fails?

If many requests fail Ã¢Â€Â” for example, if your network is down Ã¢Â€Â” then your program would get stuck in a tight, infinite retry loop. Even a 0.1-second delay in the exception handler would be very helpful to prevent the CPU from going haywire.

The code in check_proxy() is a bit redundant with the code in scrape_page(). Also, check_proxy() is inappropriately recursive. I would create a set_proxy() function that has a more comprehensive mission.

Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse

PROXIES = [
 'https://128.199.254.244:3128',
 'https://95.85.79.54:53281',
 'https://128.199.125.54:2468',
 'https://178.45.8.113:53281',
 'https://206.189.225.30:3128',
]

def set_proxy(session, proxy_candidates=PROXIES, verify=False):
 """
 Configure the session to use one of the proxy_candidates. If verify is
 True, then the proxy will have been verified to work.
 """
 while True:
 proxy = choice(proxy_candidates)
 session.proxies = urlparse(proxy).scheme: proxy
 if not verify:
 return
 try:
 print(session.get('https://httpbin.org/ip').json())
 return
 except Exception:
 pass

def scrape_page(url):
 ua = UserAgent()
 session = requests.Session()
 session.headers = 'User-Agent': ua.random
 set_proxy(session)

 while True:
 try:
 response = session.get(url)
 break
 except Exception as e:
 session.headers = 'User-Agent': ua.random
 set_proxy(session, verify=True)
 sleep(0.1)

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194977%2fusing-rotation-of-proxies-within-a-python-script%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse

PROXIES = [
 'https://128.199.254.244:3128',
 'https://95.85.79.54:53281',
 'https://128.199.125.54:2468',
 'https://178.45.8.113:53281',
 'https://206.189.225.30:3128',
]

def set_proxy(session, proxy_candidates=PROXIES, verify=False):
 """
 Configure the session to use one of the proxy_candidates. If verify is
 True, then the proxy will have been verified to work.
 """
 while True:
 proxy = choice(proxy_candidates)
 session.proxies = urlparse(proxy).scheme: proxy
 if not verify:
 return
 try:
 print(session.get('https://httpbin.org/ip').json())
 return
 except Exception:
 pass

def scrape_page(url):
 ua = UserAgent()
 session = requests.Session()
 session.headers = 'User-Agent': ua.random
 set_proxy(session)

 while True:
 try:
 response = session.get(url)
 break
 except Exception as e:
 session.headers = 'User-Agent': ua.random
 set_proxy(session, verify=True)
 sleep(0.1)

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

add a commentÂ |Â

up vote
3
down vote

accepted

Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse

PROXIES = [
 'https://128.199.254.244:3128',
 'https://95.85.79.54:53281',
 'https://128.199.125.54:2468',
 'https://178.45.8.113:53281',
 'https://206.189.225.30:3128',
]

def set_proxy(session, proxy_candidates=PROXIES, verify=False):
 """
 Configure the session to use one of the proxy_candidates. If verify is
 True, then the proxy will have been verified to work.
 """
 while True:
 proxy = choice(proxy_candidates)
 session.proxies = urlparse(proxy).scheme: proxy
 if not verify:
 return
 try:
 print(session.get('https://httpbin.org/ip').json())
 return
 except Exception:
 pass

def scrape_page(url):
 ua = UserAgent()
 session = requests.Session()
 session.headers = 'User-Agent': ua.random
 set_proxy(session)

 while True:
 try:
 response = session.get(url)
 break
 except Exception as e:
 session.headers = 'User-Agent': ua.random
 set_proxy(session, verify=True)
 sleep(0.1)

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

add a commentÂ |Â

up vote
3
down vote

accepted

Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse

PROXIES = [
 'https://128.199.254.244:3128',
 'https://95.85.79.54:53281',
 'https://128.199.125.54:2468',
 'https://178.45.8.113:53281',
 'https://206.189.225.30:3128',
]

def set_proxy(session, proxy_candidates=PROXIES, verify=False):
 """
 Configure the session to use one of the proxy_candidates. If verify is
 True, then the proxy will have been verified to work.
 """
 while True:
 proxy = choice(proxy_candidates)
 session.proxies = urlparse(proxy).scheme: proxy
 if not verify:
 return
 try:
 print(session.get('https://httpbin.org/ip').json())
 return
 except Exception:
 pass

def scrape_page(url):
 ua = UserAgent()
 session = requests.Session()
 session.headers = 'User-Agent': ua.random
 set_proxy(session)

 while True:
 try:
 response = session.get(url)
 break
 except Exception as e:
 session.headers = 'User-Agent': ua.random
 set_proxy(session, verify=True)
 sleep(0.1)

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

Instead of assuming that each proxy is HTTPS, I would write the URL of each proxy with the explicit protocol, then infer the protocol by parsing the URL.

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice
import requests
from time import sleep
from urllib.parse import urlparse

PROXIES = [
 'https://128.199.254.244:3128',
 'https://95.85.79.54:53281',
 'https://128.199.125.54:2468',
 'https://178.45.8.113:53281',
 'https://206.189.225.30:3128',
]

def set_proxy(session, proxy_candidates=PROXIES, verify=False):
 """
 Configure the session to use one of the proxy_candidates. If verify is
 True, then the proxy will have been verified to work.
 """
 while True:
 proxy = choice(proxy_candidates)
 session.proxies = urlparse(proxy).scheme: proxy
 if not verify:
 return
 try:
 print(session.get('https://httpbin.org/ip').json())
 return
 except Exception:
 pass

def scrape_page(url):
 ua = UserAgent()
 session = requests.Session()
 session.headers = 'User-Agent': ua.random
 set_proxy(session)

 while True:
 try:
 response = session.get(url)
 break
 except Exception as e:
 session.headers = 'User-Agent': ua.random
 set_proxy(session, verify=True)
 sleep(0.1)

 soup = BeautifulSoup(response.text, 'lxml')
 for items in soup.select(".question-hyperlink"):
 print(items.text)

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

edited May 24 at 16:32

Daniel

4,1132836

edited May 24 at 16:32

Daniel

4,1132836

edited May 24 at 16:32

Daniel

4,1132836

answered May 22 at 23:45

200_success

123k14143399

answered May 22 at 23:45

200_success

123k14143399

answered May 22 at 23:45

200_success

123k14143399

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

add a commentÂ |Â

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

It's a very nice review @200_success. Thanks a lot. I just wish to know which I should comply session.headers = 'User-Agent', ua.random or session.headers = 'User-Agent': ua.random? Notice the , and :. Gonna accept it in a while.
â€“Â MITHU
May 24 at 15:40

@MITHU The comma would make it a set rather than a dictionary. Bug fixed in Rev 2.
â€“Â 200_success
May 24 at 16:20

@ 200_success, One last thing to know for the clarity: It seems your defined .set_proxy() function is self dependant. Ain't it? I get confused to see the docstring as it started with the line Configure the session. Do I need to define anything over there? Thanks a zillion once again.
â€“Â MITHU
May 24 at 21:20

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

XOE,zUK,vDG,XJvagic8Y,X

搜尋此網誌

trjhtr