Recursively scrape links from web pages and check them

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.

The overall idea for this scraper is I give it a url, it searches and retrieves all urls on that page, stores them in a set and as it checks each url, it removes it from the set. In terms of the "Check", I want to see if a page contains a set of identifiers. If it does, I'll download it. If it doesn't, I'll pass.

I've got 3 objects:

domainObject - should hold identifiers and init_link against domain

linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.

# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url): 
 try:
 return bs((requests.get(url)).text, "html.parser")
 except:
 alt_url = "https://" + url
 return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
 """Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

 def __init__(self, include_exclude):

 self.name = "en.wikipedia.org"
 self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
 """ Manages a pool of visited and non-visited links as the """

 def __init__(self, domObj):

 self.inclusions = include_exclude[0]
 self.inc_links = set()
 self.exc_links = set()

 def link_pool_manage(self, url):

 print url

 if url not in self.exc_links: # add url to inc_links if not in self.exc_links
 self.inc_links.add(url)

 ### collect the hrefs in the given url and add it to pool ###

 for link in soup(url).find_all('a', href=True): # find all hrefs in url
 for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
 for link in (re.findall(inclusion, link['href'])):
 if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
 self.inc_links.add(link) # add link to the inc_links pool

 ### pass each link to pageObject ###

 for next_link in self.inc_links: # select a link from inc_links
 self.exc_links.add(next_link) # add it to the self.exc_links pool
 self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

 time.sleep(random.uniform(0.1, 1.0)) # rest for a bit 
 for next_link in self.inc_links:
 #print next_link
 return next_link


class pageObject:
 """ PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

 def __init__(self, url):
 self.url = url

 def check_page_type(self, ManObj):
 # if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it. 
 # Else, pass.

 # then return the checked link back to linkManager
 ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

 init_link = "https://en.wikipedia.org/wiki/Web_scraping"
 include_exclude = [,] # will later have a list of items to include and exclude

 domObj = domainObject(include_exclude)
 linkManObj = linkManager(domObj)
 pageObject(init_link).check_page_type(linkManObj)

My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)

Advice and constructive criticism would be appreciated.

edited May 2 at 12:13

asked May 2 at 11:01

ron g

115

1

Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â€“Â Gareth Rees
May 2 at 11:59

@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â€“Â ron g
May 2 at 12:16

add a commentÂ |Â

up vote
2
down vote

favorite

I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.

I've got 3 objects:

domainObject - should hold identifiers and init_link against domain

linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.

# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url): 
 try:
 return bs((requests.get(url)).text, "html.parser")
 except:
 alt_url = "https://" + url
 return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
 """Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

 def __init__(self, include_exclude):

 self.name = "en.wikipedia.org"
 self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
 """ Manages a pool of visited and non-visited links as the """

 def __init__(self, domObj):

 self.inclusions = include_exclude[0]
 self.inc_links = set()
 self.exc_links = set()

 def link_pool_manage(self, url):

 print url

 if url not in self.exc_links: # add url to inc_links if not in self.exc_links
 self.inc_links.add(url)

 ### collect the hrefs in the given url and add it to pool ###

 for link in soup(url).find_all('a', href=True): # find all hrefs in url
 for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
 for link in (re.findall(inclusion, link['href'])):
 if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
 self.inc_links.add(link) # add link to the inc_links pool

 ### pass each link to pageObject ###

 for next_link in self.inc_links: # select a link from inc_links
 self.exc_links.add(next_link) # add it to the self.exc_links pool
 self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

 time.sleep(random.uniform(0.1, 1.0)) # rest for a bit 
 for next_link in self.inc_links:
 #print next_link
 return next_link


class pageObject:
 """ PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

 def __init__(self, url):
 self.url = url

 def check_page_type(self, ManObj):
 # if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it. 
 # Else, pass.

 # then return the checked link back to linkManager
 ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

 init_link = "https://en.wikipedia.org/wiki/Web_scraping"
 include_exclude = [,] # will later have a list of items to include and exclude

 domObj = domainObject(include_exclude)
 linkManObj = linkManager(domObj)
 pageObject(init_link).check_page_type(linkManObj)

My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)

Advice and constructive criticism would be appreciated.

edited May 2 at 12:13

asked May 2 at 11:01

ron g

115

1

Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â€“Â Gareth Rees
May 2 at 11:59

@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â€“Â ron g
May 2 at 12:16

add a commentÂ |Â

up vote
2
down vote

favorite

I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.

I've got 3 objects:

domainObject - should hold identifiers and init_link against domain

linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.

# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url): 
 try:
 return bs((requests.get(url)).text, "html.parser")
 except:
 alt_url = "https://" + url
 return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
 """Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

 def __init__(self, include_exclude):

 self.name = "en.wikipedia.org"
 self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
 """ Manages a pool of visited and non-visited links as the """

 def __init__(self, domObj):

 self.inclusions = include_exclude[0]
 self.inc_links = set()
 self.exc_links = set()

 def link_pool_manage(self, url):

 print url

 if url not in self.exc_links: # add url to inc_links if not in self.exc_links
 self.inc_links.add(url)

 ### collect the hrefs in the given url and add it to pool ###

 for link in soup(url).find_all('a', href=True): # find all hrefs in url
 for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
 for link in (re.findall(inclusion, link['href'])):
 if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
 self.inc_links.add(link) # add link to the inc_links pool

 ### pass each link to pageObject ###

 for next_link in self.inc_links: # select a link from inc_links
 self.exc_links.add(next_link) # add it to the self.exc_links pool
 self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

 time.sleep(random.uniform(0.1, 1.0)) # rest for a bit 
 for next_link in self.inc_links:
 #print next_link
 return next_link


class pageObject:
 """ PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

 def __init__(self, url):
 self.url = url

 def check_page_type(self, ManObj):
 # if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it. 
 # Else, pass.

 # then return the checked link back to linkManager
 ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

 init_link = "https://en.wikipedia.org/wiki/Web_scraping"
 include_exclude = [,] # will later have a list of items to include and exclude

 domObj = domainObject(include_exclude)
 linkManObj = linkManager(domObj)
 pageObject(init_link).check_page_type(linkManObj)

My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)

Advice and constructive criticism would be appreciated.

edited May 2 at 12:13

asked May 2 at 11:01

ron g

115

I'm new to programming and especially new to object oriented programming. I have built a web scraper using functional programming and am trying to build another using OOP principles.

I've got 3 objects:

domainObject - should hold identifiers and init_link against domain

linkManager - holds two pools of links - "links to check" and "link to skip". Passes a link from the first pool to pageObject, then moves link to second pool.

pageObject - takes a url, looks for identifiers, downloads if meets criteria

I am mostly confused by how 2 and 3 interact and would like some feedback on how to improve feeding back the link that has been checked by pageObject back into linkManager's "link to skip" pool.

# -*- coding: utf-8 -*-
import re
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup as bs


def soup(url): 
 try:
 return bs((requests.get(url)).text, "html.parser")
 except:
 alt_url = "https://" + url
 return bs((requests.get(alt_url)).text, "html.parser")


class domainObject:
 """Stores "Domain level" info - init_link, "page of interest identifiers", domain name, object name"""

 def __init__(self, include_exclude):

 self.name = "en.wikipedia.org"
 self.include_exclude = include_exclude[0].append(str(self.name + ".*"))


class linkManager:
 """ Manages a pool of visited and non-visited links as the """

 def __init__(self, domObj):

 self.inclusions = include_exclude[0]
 self.inc_links = set()
 self.exc_links = set()

 def link_pool_manage(self, url):

 print url

 if url not in self.exc_links: # add url to inc_links if not in self.exc_links
 self.inc_links.add(url)

 ### collect the hrefs in the given url and add it to pool ###

 for link in soup(url).find_all('a', href=True): # find all hrefs in url
 for inclusion in self.inclusions: # those that match the "inlude list" (currently just the domain)
 for link in (re.findall(inclusion, link['href'])):
 if link not in self.exc_links: # if not in pool of self.exc_links (links to be excluded)
 self.inc_links.add(link) # add link to the inc_links pool

 ### pass each link to pageObject ###

 for next_link in self.inc_links: # select a link from inc_links
 self.exc_links.add(next_link) # add it to the self.exc_links pool
 self.inc_links = self.inc_links.difference(self.exc_links) # set inc_links as the difference between pools

 time.sleep(random.uniform(0.1, 1.0)) # rest for a bit 
 for next_link in self.inc_links:
 #print next_link
 return next_link


class pageObject:
 """ PageObject generated for every link provided. Checks it's own link and determines if it's a product page. If it is, it's downloaded. """

 def __init__(self, url):
 self.url = url

 def check_page_type(self, ManObj):
 # if this page matches a dict of identifiers, i.e. "if soup.find(x, y:z)", then download it. 
 # Else, pass.

 # then return the checked link back to linkManager
 ManObj.link_pool_manage(self.url)


if __name__ == '__main__':

 init_link = "https://en.wikipedia.org/wiki/Web_scraping"
 include_exclude = [,] # will later have a list of items to include and exclude

 domObj = domainObject(include_exclude)
 linkManObj = linkManager(domObj)
 pageObject(init_link).check_page_type(linkManObj)

My particular concern is around the last bit; whether this is the correct way to implement the approach and/or if it can be improved:

domObj = domainObject(include_exclude)
linkManObj = linkManager(domObj)
pageObject(init_link).check_page_type(linkManObj)

Advice and constructive criticism would be appreciated.

edited May 2 at 12:13

asked May 2 at 11:01

ron g

115

edited May 2 at 12:13

asked May 2 at 11:01

ron g

115

asked May 2 at 11:01

ron g

115

asked May 2 at 11:01

ron g

115

1

Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â€“Â Gareth Rees
May 2 at 11:59

@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â€“Â ron g
May 2 at 12:16

add a commentÂ |Â

1

Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â€“Â Gareth Rees
May 2 at 11:59

@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â€“Â ron g
May 2 at 12:16

Welcome to Code Review! You seem to have omitted the code for actually checking the links. This is fine (if you don't want that code to be reviewed) but it would help us if you could edit the post to explain exactly what you've omitted.
â€“Â Gareth Rees
May 2 at 11:59

@GarethRees done. Basically all it does it check if soup.find() returns something for a set of identifiers I will pass to the check_page_type method. If it does, it'll save the html page to a directory. If it doesn't match, it'll pass it.
â€“Â ron g
May 2 at 12:16

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193444%2frecursively-scrape-links-from-web-pages-and-check-them%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr