Web-scraping through a rotating proxy script

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.



I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.



This is the complete approach:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]

def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"

def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass

while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return

def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response

while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)

soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)

if __name__ == '__main__':
for link in links:
parse_content(link)






share|improve this question















  • 1




    except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
    – Reinderien
    Jun 5 at 1:19






  • 1




    Why are you reassigning session.proxies?
    – Reinderien
    Jun 5 at 1:20
















up vote
2
down vote

favorite












I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.



I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.



This is the complete approach:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]

def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"

def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass

while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return

def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response

while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)

soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)

if __name__ == '__main__':
for link in links:
parse_content(link)






share|improve this question















  • 1




    except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
    – Reinderien
    Jun 5 at 1:19






  • 1




    Why are you reassigning session.proxies?
    – Reinderien
    Jun 5 at 1:20












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.



I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.



This is the complete approach:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]

def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"

def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass

while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return

def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response

while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)

soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)

if __name__ == '__main__':
for link in links:
parse_content(link)






share|improve this question











I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.



I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.



This is the complete approach:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import choice

links = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='.format(page) for page in range(1,6)]

def get_proxies():
link = 'https://www.sslproxies.org/'
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies #producing list of proxies that supports "https"

def check_proxy(session, proxy_list=get_proxies(), validated=False):
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
try:
print(session.get('https://httpbin.org/ip').json())
validated = True #try to make sure it is a working proxy
return
except Exception: pass

while True:
proxy = choice(proxy_list)
session.proxies = 'https': 'https://'.format(proxy)
if not validated: #otherwise get back to ensure it does fetch a working proxy
print("-------go validate--------")
return

def parse_content(url):
ua = UserAgent()
session = requests.Session()
session.headers = 'User-Agent': ua.random
check_proxy(session) #collect a working proxy to be used to fetch a valid response

while True:
try:
response = session.get(url)
break #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest
except Exception as e:
session.headers = 'User-Agent': ua.random
check_proxy(session) #if exception is raised, start over again
parse_content(url)

soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)

if __name__ == '__main__':
for link in links:
parse_content(link)








share|improve this question










share|improve this question




share|improve this question









asked Jun 4 at 21:11









Topto

2158




2158







  • 1




    except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
    – Reinderien
    Jun 5 at 1:19






  • 1




    Why are you reassigning session.proxies?
    – Reinderien
    Jun 5 at 1:20












  • 1




    except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
    – Reinderien
    Jun 5 at 1:19






  • 1




    Why are you reassigning session.proxies?
    – Reinderien
    Jun 5 at 1:20







1




1




except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
– Reinderien
Jun 5 at 1:19




except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.
– Reinderien
Jun 5 at 1:19




1




1




Why are you reassigning session.proxies?
– Reinderien
Jun 5 at 1:20




Why are you reassigning session.proxies?
– Reinderien
Jun 5 at 1:20










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










From your description you want your code to perform these tasks:



  1. Get a list of proxies

  2. That support https

  3. That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).



I would use a couple of generators for that:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle


def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)

def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy


def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))


def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again


def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)


if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)


This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter outside of parse_content and feed it all the way down to get_proxy.






share|improve this answer























  • This is so damn perfect. Thanks @Graipher for such a refined script.
    – Topto
    Jun 5 at 11:58










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195842%2fweb-scraping-through-a-rotating-proxy-script%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










From your description you want your code to perform these tasks:



  1. Get a list of proxies

  2. That support https

  3. That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).



I would use a couple of generators for that:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle


def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)

def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy


def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))


def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again


def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)


if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)


This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter outside of parse_content and feed it all the way down to get_proxy.






share|improve this answer























  • This is so damn perfect. Thanks @Graipher for such a refined script.
    – Topto
    Jun 5 at 11:58














up vote
2
down vote



accepted










From your description you want your code to perform these tasks:



  1. Get a list of proxies

  2. That support https

  3. That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).



I would use a couple of generators for that:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle


def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)

def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy


def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))


def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again


def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)


if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)


This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter outside of parse_content and feed it all the way down to get_proxy.






share|improve this answer























  • This is so damn perfect. Thanks @Graipher for such a refined script.
    – Topto
    Jun 5 at 11:58












up vote
2
down vote



accepted







up vote
2
down vote



accepted






From your description you want your code to perform these tasks:



  1. Get a list of proxies

  2. That support https

  3. That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).



I would use a couple of generators for that:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle


def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)

def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy


def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))


def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again


def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)


if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)


This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter outside of parse_content and feed it all the way down to get_proxy.






share|improve this answer















From your description you want your code to perform these tasks:



  1. Get a list of proxies

  2. That support https

  3. That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).



I would use a couple of generators for that:



import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import shuffle


def get_proxies(link):
response = requests.get(link)
soup = BeautifulSoup(response.text,"lxml")
https_proxies = filter(lambda item: "yes" in item.text,
soup.select("table.table tr"))
for item in https_proxies:
yield ":".format(item.select_one("td").text,
item.select_one("td:nth-of-type(2)").text)

def get_random_proxies_iter():
proxies = list(get_proxies('https://www.sslproxies.org/'))
shuffle(proxies)
return iter(proxies) # iter so we can call next on it to get the next proxy


def get_proxy(session, proxies, validated=False):
session.proxies = 'https': 'https://'.format(next(proxies))
if validated:
while True:
try:
return session.get('https://httpbin.org/ip').json()
except Exception:
session.proxies = 'https': 'https://'.format(next(proxies))


def get_response(url):
session = requests.Session()
ua = UserAgent()
proxies = get_random_proxies_iter()
while True:
try:
session.headers = 'User-Agent': ua.random
print(get_proxy(session, proxies, validated=True)) #collect a working proxy to be used to fetch a valid response
return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop
except StopIteration:
raise # No more proxies left to try
except Exception:
pass # Other errors: try again


def parse_content(url):
response = get_response(url)
soup = BeautifulSoup(response.text, 'lxml')
for items in soup.select(".info span[itemprop='name']"):
print(items.text)


if __name__ == '__main__':
url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page='
links = [url.format(page) for page in range(1, 6)]
for link in links:
parse_content(link)


This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just call get_random_proxies_iter outside of parse_content and feed it all the way down to get_proxy.







share|improve this answer















share|improve this answer



share|improve this answer








edited Jun 5 at 12:01


























answered Jun 5 at 9:32









Graipher

20.4k42981




20.4k42981











  • This is so damn perfect. Thanks @Graipher for such a refined script.
    – Topto
    Jun 5 at 11:58
















  • This is so damn perfect. Thanks @Graipher for such a refined script.
    – Topto
    Jun 5 at 11:58















This is so damn perfect. Thanks @Graipher for such a refined script.
– Topto
Jun 5 at 11:58




This is so damn perfect. Thanks @Graipher for such a refined script.
– Topto
Jun 5 at 11:58












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195842%2fweb-scraping-through-a-rotating-proxy-script%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation