Recursively scrape links from web pages and check them

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.



The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set and as it checks each url, it removes it from the set. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.



I've got 3 objects:



  1. domainObject - should hold identifiers and init_link against domain

  2. linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

  3. pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.



# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

def __init__(self, include_exclude):

self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
""" Manages a pool of visited and non-visited links as the """

def __init__(self, domObj):

self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()

def link_pool_manage(self, url):

print url

if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)

### collect the hrefs in the given url and add it to pool ###

for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool

### pass each link to pageObject ###

for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link


class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

def __init__(self, url):
self.url = url

def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.

# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:



domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


Advice and constructive criticism would be appreciated.







share|improve this question

















  • 1




    Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
    – Gareth Rees
    May 2 at 11:59










  • @GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
    – ron g
    May 2 at 12:16
















up vote
2
down vote

favorite












I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.



The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set and as it checks each url, it removes it from the set. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.



I've got 3 objects:



  1. domainObject - should hold identifiers and init_link against domain

  2. linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

  3. pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.



# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

def __init__(self, include_exclude):

self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
""" Manages a pool of visited and non-visited links as the """

def __init__(self, domObj):

self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()

def link_pool_manage(self, url):

print url

if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)

### collect the hrefs in the given url and add it to pool ###

for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool

### pass each link to pageObject ###

for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link


class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

def __init__(self, url):
self.url = url

def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.

# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:



domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


Advice and constructive criticism would be appreciated.







share|improve this question

















  • 1




    Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
    – Gareth Rees
    May 2 at 11:59










  • @GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
    – ron g
    May 2 at 12:16












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.



The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set and as it checks each url, it removes it from the set. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.



I've got 3 objects:



  1. domainObject - should hold identifiers and init_link against domain

  2. linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

  3. pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.



# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

def __init__(self, include_exclude):

self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
""" Manages a pool of visited and non-visited links as the """

def __init__(self, domObj):

self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()

def link_pool_manage(self, url):

print url

if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)

### collect the hrefs in the given url and add it to pool ###

for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool

### pass each link to pageObject ###

for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link


class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

def __init__(self, url):
self.url = url

def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.

# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:



domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


Advice and constructive criticism would be appreciated.







share|improve this question













I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.



The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set and as it checks each url, it removes it from the set. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.



I've got 3 objects:



  1. domainObject - should hold identifiers and init_link against domain

  2. linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

  3. pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.



# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url):
try:
return bs((requests.get(url)).text, "html.parser")
except:
alt_url = "https://" + url
return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
"""Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

def __init__(self, include_exclude):

self.name = "en.wikipedia.org"
self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
""" Manages a pool of visited and non-visited links as the """

def __init__(self, domObj):

self.inclusions = include_exclude[0]
self.inc_links = set()
self.exc_links = set()

def link_pool_manage(self, url):

print url

if url not in self.exc_links: # add url to inc_links if not in self.exc_links
self.inc_links.add(url)

### collect the hrefs in the given url and add it to pool ###

for link in soup(url).find_all('a', href=True): # find all hrefs in url
for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
for link in (re.findall(inclusion, link['href'])):
if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
self.inc_links.add(link) # add link to the inc_links pool

### pass each link to pageObject ###

for next_link in self.inc_links: # select a link from inc_links
self.exc_links.add(next_link) # add it to the self.exc_links pool
self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

time.sleep(random.uniform(0.1, 1.0)) # rest for a bit
for next_link in self.inc_links:
#print next_link
return next_link


class pageObject:
""" PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

def __init__(self, url):
self.url = url

def check_page_type(self, ManObj):
# if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it.
# Else, pass.

# then return the checked link back to linkManager
ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

init_link = "https://en.wikipedia.org/wiki/Web_scraping"
include_exclude = [,] # will later have a list of items to include and exclude

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:



domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)


Advice and constructive criticism would be appreciated.









share|improve this question












share|improve this question




share|improve this question








edited May 2 at 12:13
























asked May 2 at 11:01









ron g

115




115







  • 1




    Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
    – Gareth Rees
    May 2 at 11:59










  • @GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
    – ron g
    May 2 at 12:16












  • 1




    Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
    – Gareth Rees
    May 2 at 11:59










  • @GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
    – ron g
    May 2 at 12:16







1




1




Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
– Gareth Rees
May 2 at 11:59




Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
– Gareth Rees
May 2 at 11:59












@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
– ron g
May 2 at 12:16




@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
– ron g
May 2 at 12:16















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193444%2frecursively-scrape-links-from-web-pages-and-check-them%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193444%2frecursively-scrape-links-from-web-pages-and-check-them%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?