Parsing different categories using scrapy from a webpage

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu then going to one layer deep it again tracks the links within class niveau2 then again follow the links within class niveau3 then tracking the links within class art-indexbutton-wrapper it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.

This is the spider (website included):

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
 name = 'honda'

 def start_requests(self):
 yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

 def parse_links(self, response):
 for link in response.css('.art-indexhmenu a::attr(href)').extract():
 yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

 def parse_inner_links(self, response):
 for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

 def parse_cat_links(self, response):
 for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

 def parse_target_links(self, response):
 for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

 def parse_docs(self, response):
 items = [item for item in response.css('.titre_12_red::text').extract()]
 yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
 'USER_AGENT': 'Mozilla/5.0', 
)
c.crawl(BikePartsSpider)
c.start()

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

asked Apr 3 at 12:08

SIM

1,005420

"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â€“Â Mast
Apr 11 at 15:20

Do you pipe the output into something useful with a secondary program by chance?
â€“Â Mast
Apr 11 at 16:00

In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â€“Â SIM
Apr 11 at 17:21

But you don't appear to do anything with those categories. Is that correct?
â€“Â Mast
Apr 11 at 17:25

Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â€“Â SIM
Apr 11 at 17:29

Â |Â
show 2 more comments

up vote
5
down vote

favorite

This is the spider (website included):

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
 name = 'honda'

 def start_requests(self):
 yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

 def parse_links(self, response):
 for link in response.css('.art-indexhmenu a::attr(href)').extract():
 yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

 def parse_inner_links(self, response):
 for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

 def parse_cat_links(self, response):
 for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

 def parse_target_links(self, response):
 for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

 def parse_docs(self, response):
 items = [item for item in response.css('.titre_12_red::text').extract()]
 yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
 'USER_AGENT': 'Mozilla/5.0', 
)
c.crawl(BikePartsSpider)
c.start()

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

asked Apr 3 at 12:08

SIM

1,005420

"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â€“Â Mast
Apr 11 at 15:20

Do you pipe the output into something useful with a secondary program by chance?
â€“Â Mast
Apr 11 at 16:00

In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â€“Â SIM
Apr 11 at 17:21

But you don't appear to do anything with those categories. Is that correct?
â€“Â Mast
Apr 11 at 17:25

Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â€“Â SIM
Apr 11 at 17:29

Â |Â
show 2 more comments

up vote
5
down vote

favorite

This is the spider (website included):

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
 name = 'honda'

 def start_requests(self):
 yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

 def parse_links(self, response):
 for link in response.css('.art-indexhmenu a::attr(href)').extract():
 yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

 def parse_inner_links(self, response):
 for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

 def parse_cat_links(self, response):
 for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

 def parse_target_links(self, response):
 for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

 def parse_docs(self, response):
 items = [item for item in response.css('.titre_12_red::text').extract()]
 yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
 'USER_AGENT': 'Mozilla/5.0', 
)
c.crawl(BikePartsSpider)
c.start()

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

asked Apr 3 at 12:08

SIM

1,005420

This is the spider (website included):

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class BikePartsSpider(scrapy.Spider):
 name = 'honda'

 def start_requests(self):
 yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)

 def parse_links(self, response):
 for link in response.css('.art-indexhmenu a::attr(href)').extract():
 yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page

 def parse_inner_links(self, response):
 for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer

 def parse_cat_links(self, response):
 for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_target_links) ## go inside another layer

 def parse_target_links(self, response):
 for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
 yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page

 def parse_docs(self, response):
 items = [item for item in response.css('.titre_12_red::text').extract()]
 yield "categories":items #this is where the scraper parses the info

c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
 'USER_AGENT': 'Mozilla/5.0', 
)
c.crawl(BikePartsSpider)
c.start()

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

asked Apr 3 at 12:08

SIM

1,005420

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

edited Apr 6 at 8:04

Peilonrayz

24.3k336102

asked Apr 3 at 12:08

SIM

1,005420

asked Apr 3 at 12:08

SIM

1,005420

asked Apr 3 at 12:08

SIM

1,005420

"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â€“Â Mast
Apr 11 at 15:20

Do you pipe the output into something useful with a secondary program by chance?
â€“Â Mast
Apr 11 at 16:00

In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â€“Â SIM
Apr 11 at 17:21

But you don't appear to do anything with those categories. Is that correct?
â€“Â Mast
Apr 11 at 17:25

Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â€“Â SIM
Apr 11 at 17:29

Â |Â
show 2 more comments

"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â€“Â Mast
Apr 11 at 15:20

Do you pipe the output into something useful with a secondary program by chance?
â€“Â Mast
Apr 11 at 16:00

In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â€“Â SIM
Apr 11 at 17:21

But you don't appear to do anything with those categories. Is that correct?
â€“Â Mast
Apr 11 at 17:25

Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â€“Â SIM
Apr 11 at 17:29

"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â€“Â Mast
Apr 11 at 15:20

Do you pipe the output into something useful with a secondary program by chance?
â€“Â Mast
Apr 11 at 16:00

In case of output, the categories defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â€“Â SIM
Apr 11 at 17:21

But you don't appear to do anything with those categories. Is that correct?
â€“Â Mast
Apr 11 at 17:25

Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â€“Â SIM
Apr 11 at 17:29

Â |Â
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191152%2fparsing-different-categories-using-scrapy-from-a-webpage%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr