Parsing different categories using scrapy from a webpage

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu then going to one layer deep it again tracks the links within class niveau2 then again follow the links within class niveau3 then tracking the links within class art-indexbutton-wrapper it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.



This is the spider (website included):



import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
name = 'honda'

def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()






share|improve this question





















  • "My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
    – Mast
    Apr 11 at 15:20










  • Do you pipe the output into something useful with a secondary program by chance?
    – Mast
    Apr 11 at 16:00










  • In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
    – SIM
    Apr 11 at 17:21










  • But you don't appear to do anything with those categories. Is that correct?
    – Mast
    Apr 11 at 17:25










  • Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
    – SIM
    Apr 11 at 17:29
















up vote
5
down vote

favorite












I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu then going to one layer deep it again tracks the links within class niveau2 then again follow the links within class niveau3 then tracking the links within class art-indexbutton-wrapper it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.



This is the spider (website included):



import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
name = 'honda'

def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()






share|improve this question





















  • "My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
    – Mast
    Apr 11 at 15:20










  • Do you pipe the output into something useful with a secondary program by chance?
    – Mast
    Apr 11 at 16:00










  • In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
    – SIM
    Apr 11 at 17:21










  • But you don't appear to do anything with those categories. Is that correct?
    – Mast
    Apr 11 at 17:25










  • Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
    – SIM
    Apr 11 at 17:29












up vote
5
down vote

favorite









up vote
5
down vote

favorite











I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu then going to one layer deep it again tracks the links within class niveau2 then again follow the links within class niveau3 then tracking the links within class art-indexbutton-wrapper it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.



This is the spider (website included):



import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
name = 'honda'

def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()






share|improve this question













I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu then going to one layer deep it again tracks the links within class niveau2 then again follow the links within class niveau3 then tracking the links within class art-indexbutton-wrapper it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.



This is the spider (website included):



import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
name = 'honda'

def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()








share|improve this question












share|improve this question




share|improve this question








edited Apr 6 at 8:04









Peilonrayz

24.3k336102




24.3k336102









asked Apr 3 at 12:08









SIM

1,005420




1,005420











  • "My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
    – Mast
    Apr 11 at 15:20










  • Do you pipe the output into something useful with a secondary program by chance?
    – Mast
    Apr 11 at 16:00










  • In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
    – SIM
    Apr 11 at 17:21










  • But you don't appear to do anything with those categories. Is that correct?
    – Mast
    Apr 11 at 17:25










  • Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
    – SIM
    Apr 11 at 17:29
















  • "My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
    – Mast
    Apr 11 at 15:20










  • Do you pipe the output into something useful with a secondary program by chance?
    – Mast
    Apr 11 at 16:00










  • In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
    – SIM
    Apr 11 at 17:21










  • But you don't appear to do anything with those categories. Is that correct?
    – Mast
    Apr 11 at 17:25










  • Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
    – SIM
    Apr 11 at 17:29















"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
– Mast
Apr 11 at 15:20




"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
– Mast
Apr 11 at 15:20












Do you pipe the output into something useful with a secondary program by chance?
– Mast
Apr 11 at 16:00




Do you pipe the output into something useful with a secondary program by chance?
– Mast
Apr 11 at 16:00












In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
– SIM
Apr 11 at 17:21




In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
– SIM
Apr 11 at 17:21












But you don't appear to do anything with those categories. Is that correct?
– Mast
Apr 11 at 17:25




But you don't appear to do anything with those categories. Is that correct?
– Mast
Apr 11 at 17:25












Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
– SIM
Apr 11 at 17:29




Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
– SIM
Apr 11 at 17:29















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191152%2fparsing-different-categories-using-scrapy-from-a-webpage%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191152%2fparsing-different-categories-using-scrapy-from-a-webpage%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?