Recursively scrape links from web pages and check them
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.
The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set
and as it checks each url, it removes it from the set
. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.
I've got 3 objects:
- domainObject - should hold identifiers and init_link against domain
- linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.
- pageObject - takes a url, looks for identifiers, downloads if meets criteria
I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.
# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs
def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")
class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""
def __init__(self, include_exclude):
self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))
class linkManager:
""" Manages a pool of visited and non-visited links as the """
def __init__(self, domObj):
self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()
def link_pool_manage(self, url):
print url
if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)
### collect the hrefs in the given url and add it to pool ###
for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool
### pass each link to pageObject ###
for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools
time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link
class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """
def __init__(self, url):
self.url = url
def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.
# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)
if __name__ == '__main__':
init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
Advice and constructive criticism would be appreciated.
python object-oriented python-2.7 web-scraping beautifulsoup
add a comment |Â
up vote
2
down vote
favorite
I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.
The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set
and as it checks each url, it removes it from the set
. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.
I've got 3 objects:
- domainObject - should hold identifiers and init_link against domain
- linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.
- pageObject - takes a url, looks for identifiers, downloads if meets criteria
I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.
# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs
def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")
class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""
def __init__(self, include_exclude):
self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))
class linkManager:
""" Manages a pool of visited and non-visited links as the """
def __init__(self, domObj):
self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()
def link_pool_manage(self, url):
print url
if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)
### collect the hrefs in the given url and add it to pool ###
for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool
### pass each link to pageObject ###
for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools
time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link
class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """
def __init__(self, url):
self.url = url
def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.
# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)
if __name__ == '__main__':
init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
Advice and constructive criticism would be appreciated.
python object-oriented python-2.7 web-scraping beautifulsoup
1
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.
The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set
and as it checks each url, it removes it from the set
. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.
I've got 3 objects:
- domainObject - should hold identifiers and init_link against domain
- linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.
- pageObject - takes a url, looks for identifiers, downloads if meets criteria
I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.
# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs
def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")
class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""
def __init__(self, include_exclude):
self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))
class linkManager:
""" Manages a pool of visited and non-visited links as the """
def __init__(self, domObj):
self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()
def link_pool_manage(self, url):
print url
if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)
### collect the hrefs in the given url and add it to pool ###
for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool
### pass each link to pageObject ###
for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools
time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link
class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """
def __init__(self, url):
self.url = url
def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.
# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)
if __name__ == '__main__':
init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
Advice and constructive criticism would be appreciated.
python object-oriented python-2.7 web-scraping beautifulsoup
I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.
The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set
and as it checks each url, it removes it from the set
. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.
I've got 3 objects:
- domainObject - should hold identifiers and init_link against domain
- linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.
- pageObject - takes a url, looks for identifiers, downloads if meets criteria
I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.
# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs
def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")
class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""
def __init__(self, include_exclude):
self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))
class linkManager:
""" Manages a pool of visited and non-visited links as the """
def __init__(self, domObj):
self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()
def link_pool_manage(self, url):
print url
if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)
### collect the hrefs in the given url and add it to pool ###
for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool
### pass each link to pageObject ###
for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools
time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link
class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """
def __init__(self, url):
self.url = url
def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.
# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)
if __name__ == '__main__':
init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:
domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)
Advice and constructive criticism would be appreciated.
python object-oriented python-2.7 web-scraping beautifulsoup
edited May 2 at 12:13
asked May 2 at 11:01
ron g
115
115
1
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16
add a comment |Â
1
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16
1
1
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193444%2frecursively-scrape-links-from-web-pages-and-check-them%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â Gareth Rees
May 2 at 11:59
@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â ron g
May 2 at 12:16