Parsing different categories using scrapy from a webpage
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu
then going to one layer deep it again tracks the links within class niveau2
then again follow the links within class niveau3
then tracking the links within class art-indexbutton-wrapper
it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.
This is the spider (website included):
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess
class BikePartsSpider(scrapy.Spider):
name = 'honda'
def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)
def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page
def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer
def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer
def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page
def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info
c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()
python python-3.x web-scraping scrapy
 |Â
show 2 more comments
up vote
5
down vote
favorite
I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu
then going to one layer deep it again tracks the links within class niveau2
then again follow the links within class niveau3
then tracking the links within class art-indexbutton-wrapper
it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.
This is the spider (website included):
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess
class BikePartsSpider(scrapy.Spider):
name = 'honda'
def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)
def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page
def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer
def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer
def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page
def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info
c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()
python python-3.x web-scraping scrapy
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
In case of output, thecategories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â SIM
Apr 11 at 17:21
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29
 |Â
show 2 more comments
up vote
5
down vote
favorite
up vote
5
down vote
favorite
I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu
then going to one layer deep it again tracks the links within class niveau2
then again follow the links within class niveau3
then tracking the links within class art-indexbutton-wrapper
it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.
This is the spider (website included):
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess
class BikePartsSpider(scrapy.Spider):
name = 'honda'
def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)
def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page
def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer
def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer
def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page
def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info
c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()
python python-3.x web-scraping scrapy
I've written a script in python scrapy to parse different "model", "country" and "year" of various bikes from a webpage. There are several subcategories to track to reach the target page to scrape the required info. The below scraper first starts from the main page then track each links within class art-indexhmenu
then going to one layer deep it again tracks the links within class niveau2
then again follow the links within class niveau3
then tracking the links within class art-indexbutton-wrapper
it reaches the target page. Then it scrapes "model", "country" and "years" of each products. My scraper is doing it's job errorlessly. However, although it is working nice, the way I've created this scraper is very repetitive to look at. As there are always room for improvement, I suppose there should be any way to make it more robust by getting rid of banality. Thanks in advance.
This is the spider (website included):
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess
class BikePartsSpider(scrapy.Spider):
name = 'honda'
def start_requests(self):
yield Request(url = "https://www.bike-parts-honda.com/", callback = self.parse_links)
def parse_links(self, response):
for link in response.css('.art-indexhmenu a::attr(href)').extract():
yield response.follow(link, callback = self.parse_inner_links) #going to one layer deep from landing page
def parse_inner_links(self, response):
for link in response.css('.niveau2 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_cat_links) # digging deep to go another layer
def parse_cat_links(self, response):
for link in response.css('.niveau3 .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_target_links) ## go inside another layer
def parse_target_links(self, response):
for link in response.css('.art-indexbutton-wrapper .art-indexbutton::attr(href)').extract():
yield response.follow(link, callback = self.parse_docs) # tracking links leading to the target page
def parse_docs(self, response):
items = [item for item in response.css('.titre_12_red::text').extract()]
yield "categories":items #this is where the scraper parses the info
c = CrawlerProcess( #using CrawlerProcess() method to be able to run from the IDE
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(BikePartsSpider)
c.start()
python python-3.x web-scraping scrapy
edited Apr 6 at 8:04
Peilonrayz
24.3k336102
24.3k336102
asked Apr 3 at 12:08
SIM
1,005420
1,005420
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
In case of output, thecategories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â SIM
Apr 11 at 17:21
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29
 |Â
show 2 more comments
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
In case of output, thecategories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.
â SIM
Apr 11 at 17:21
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
In case of output, the
categories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.â SIM
Apr 11 at 17:21
In case of output, the
categories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.â SIM
Apr 11 at 17:21
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29
 |Â
show 2 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191152%2fparsing-different-categories-using-scrapy-from-a-webpage%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
"My scraper is doing it's job errorlessly." In how much time, usually? And what's the expected output?
â Mast
Apr 11 at 15:20
Do you pipe the output into something useful with a secondary program by chance?
â Mast
Apr 11 at 16:00
In case of output, the
categories
defined within my scraper is enough @Mast. I'm not worried about output. The thing is I wish to know any better way other than what i did above cause it looks so repetitive.â SIM
Apr 11 at 17:21
But you don't appear to do anything with those categories. Is that correct?
â Mast
Apr 11 at 17:25
Yep, right you are. I can see the results in the IDE and I checked whether the output I'm having is accurate. Btw, why the output is so important here as I've stated in the first place that i would like to go for any better design. Thanks.
â SIM
Apr 11 at 17:29