Finding the comic ID of the last XKCD comic published

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I decided to sidetrack and create a XKCD viewer. For certain functionality, I needed to be able to find the ID of the last comic published. This was my attempt. I'm using Enlive here to parse the page itself.
I struggled with trying to find a CSS selector to get the text node, then finally gave up and decided to do some manual parsing. It got long, and ugly, but it works! The problem is that the only place I can concretely find page IDs is as a note at the bottom of the page:
Permanent link to this comic: https://xkcd.com/1988/
To parse that ID at the end of the link out, I need to find the text node, then parse the String. The latter was easy. The former took me a little under an hour due mostly to inexperience with CSS selectors.
What I'm looking for:
- Is there a way to get the text node directly via Enlive CSS-like selectors?
- Anything else that may simplify this. It's quite a series of transformations. I obviously could separate it down into a few function, but I can't see ever needing the functionality anywhere else, and it's fairly simple to test as is. Any recommendations here?
Use as of posting this:
(find-last-id)
=> 1988
(ns xkcd-viewer.mcve
(:require [net.cgrand.enlive-html :as e])
(:import (java.net URL)))
(def base-url "https://xkcd.com/")
; I actually use this a couple time in the real code. It doensn't seem as useful here though.
(defn parse-id?
"Returns the str-n parsed as a long, or nil if it's unparsable."
[str-n]
(try
(Long/parseLong str-n)
(catch NumberFormatException _
nil)))
(defn find-last-id
(let [digit? #(Character/isDigit ^Character %)
id-container (-> (e/html-resource (URL. base-url))
(e/select [:#middleContainer])
(first)
(:content))
raw-id (->> id-container
; The text node to find is surrounded by <br>s, so
(drop-while #(not= (:tag %) :br)) ; get rid of everything before the first br,
(drop 1) ; then the br itself,
(first) ; then get the text node, then
(drop-while (comp not digit?))
(take-while digit?)
(apply str))] ; then turn the digits into a string to be parsed.
(if-let [parsed (parse-id? raw-id)]
parsed
(throw (RuntimeException.
(str "Parser broken! Did XKCD change their site?nFound ID: " raw-id))))))
parsing web-scraping clojure
add a comment |Â
up vote
3
down vote
favorite
I decided to sidetrack and create a XKCD viewer. For certain functionality, I needed to be able to find the ID of the last comic published. This was my attempt. I'm using Enlive here to parse the page itself.
I struggled with trying to find a CSS selector to get the text node, then finally gave up and decided to do some manual parsing. It got long, and ugly, but it works! The problem is that the only place I can concretely find page IDs is as a note at the bottom of the page:
Permanent link to this comic: https://xkcd.com/1988/
To parse that ID at the end of the link out, I need to find the text node, then parse the String. The latter was easy. The former took me a little under an hour due mostly to inexperience with CSS selectors.
What I'm looking for:
- Is there a way to get the text node directly via Enlive CSS-like selectors?
- Anything else that may simplify this. It's quite a series of transformations. I obviously could separate it down into a few function, but I can't see ever needing the functionality anywhere else, and it's fairly simple to test as is. Any recommendations here?
Use as of posting this:
(find-last-id)
=> 1988
(ns xkcd-viewer.mcve
(:require [net.cgrand.enlive-html :as e])
(:import (java.net URL)))
(def base-url "https://xkcd.com/")
; I actually use this a couple time in the real code. It doensn't seem as useful here though.
(defn parse-id?
"Returns the str-n parsed as a long, or nil if it's unparsable."
[str-n]
(try
(Long/parseLong str-n)
(catch NumberFormatException _
nil)))
(defn find-last-id
(let [digit? #(Character/isDigit ^Character %)
id-container (-> (e/html-resource (URL. base-url))
(e/select [:#middleContainer])
(first)
(:content))
raw-id (->> id-container
; The text node to find is surrounded by <br>s, so
(drop-while #(not= (:tag %) :br)) ; get rid of everything before the first br,
(drop 1) ; then the br itself,
(first) ; then get the text node, then
(drop-while (comp not digit?))
(take-while digit?)
(apply str))] ; then turn the digits into a string to be parsed.
(if-let [parsed (parse-id? raw-id)]
parsed
(throw (RuntimeException.
(str "Parser broken! Did XKCD change their site?nFound ID: " raw-id))))))
parsing web-scraping clojure
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I decided to sidetrack and create a XKCD viewer. For certain functionality, I needed to be able to find the ID of the last comic published. This was my attempt. I'm using Enlive here to parse the page itself.
I struggled with trying to find a CSS selector to get the text node, then finally gave up and decided to do some manual parsing. It got long, and ugly, but it works! The problem is that the only place I can concretely find page IDs is as a note at the bottom of the page:
Permanent link to this comic: https://xkcd.com/1988/
To parse that ID at the end of the link out, I need to find the text node, then parse the String. The latter was easy. The former took me a little under an hour due mostly to inexperience with CSS selectors.
What I'm looking for:
- Is there a way to get the text node directly via Enlive CSS-like selectors?
- Anything else that may simplify this. It's quite a series of transformations. I obviously could separate it down into a few function, but I can't see ever needing the functionality anywhere else, and it's fairly simple to test as is. Any recommendations here?
Use as of posting this:
(find-last-id)
=> 1988
(ns xkcd-viewer.mcve
(:require [net.cgrand.enlive-html :as e])
(:import (java.net URL)))
(def base-url "https://xkcd.com/")
; I actually use this a couple time in the real code. It doensn't seem as useful here though.
(defn parse-id?
"Returns the str-n parsed as a long, or nil if it's unparsable."
[str-n]
(try
(Long/parseLong str-n)
(catch NumberFormatException _
nil)))
(defn find-last-id
(let [digit? #(Character/isDigit ^Character %)
id-container (-> (e/html-resource (URL. base-url))
(e/select [:#middleContainer])
(first)
(:content))
raw-id (->> id-container
; The text node to find is surrounded by <br>s, so
(drop-while #(not= (:tag %) :br)) ; get rid of everything before the first br,
(drop 1) ; then the br itself,
(first) ; then get the text node, then
(drop-while (comp not digit?))
(take-while digit?)
(apply str))] ; then turn the digits into a string to be parsed.
(if-let [parsed (parse-id? raw-id)]
parsed
(throw (RuntimeException.
(str "Parser broken! Did XKCD change their site?nFound ID: " raw-id))))))
parsing web-scraping clojure
I decided to sidetrack and create a XKCD viewer. For certain functionality, I needed to be able to find the ID of the last comic published. This was my attempt. I'm using Enlive here to parse the page itself.
I struggled with trying to find a CSS selector to get the text node, then finally gave up and decided to do some manual parsing. It got long, and ugly, but it works! The problem is that the only place I can concretely find page IDs is as a note at the bottom of the page:
Permanent link to this comic: https://xkcd.com/1988/
To parse that ID at the end of the link out, I need to find the text node, then parse the String. The latter was easy. The former took me a little under an hour due mostly to inexperience with CSS selectors.
What I'm looking for:
- Is there a way to get the text node directly via Enlive CSS-like selectors?
- Anything else that may simplify this. It's quite a series of transformations. I obviously could separate it down into a few function, but I can't see ever needing the functionality anywhere else, and it's fairly simple to test as is. Any recommendations here?
Use as of posting this:
(find-last-id)
=> 1988
(ns xkcd-viewer.mcve
(:require [net.cgrand.enlive-html :as e])
(:import (java.net URL)))
(def base-url "https://xkcd.com/")
; I actually use this a couple time in the real code. It doensn't seem as useful here though.
(defn parse-id?
"Returns the str-n parsed as a long, or nil if it's unparsable."
[str-n]
(try
(Long/parseLong str-n)
(catch NumberFormatException _
nil)))
(defn find-last-id
(let [digit? #(Character/isDigit ^Character %)
id-container (-> (e/html-resource (URL. base-url))
(e/select [:#middleContainer])
(first)
(:content))
raw-id (->> id-container
; The text node to find is surrounded by <br>s, so
(drop-while #(not= (:tag %) :br)) ; get rid of everything before the first br,
(drop 1) ; then the br itself,
(first) ; then get the text node, then
(drop-while (comp not digit?))
(take-while digit?)
(apply str))] ; then turn the digits into a string to be parsed.
(if-let [parsed (parse-id? raw-id)]
parsed
(throw (RuntimeException.
(str "Parser broken! Did XKCD change their site?nFound ID: " raw-id))))))
parsing web-scraping clojure
edited May 3 at 2:57
200_success
123k14142399
123k14142399
asked May 2 at 23:56
Carcigenicate
2,31911128
2,31911128
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06
add a comment |Â
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
I'm not sure it's much shorter than what you wrote, but finding stuff in any tree-like data structure is what I created the tupelo.forest library for.
Here is a solution for your problem:
(dotest
(when false ; manually enable to grab a new copy of the webpage
(spit "xkcd-sample.html"
(slurp "https://xkcd.com")))
(with-forest (new-forest)
(let [doc (it-> (xkcd)
(drop-if #(= :dtd (:type %)) it)
(only it))
root-hid (add-tree-enlive doc)
>> (remove-whitespace-leaves)
;>> (spyx-pretty (hid->bush root-hid))
hid-keep-fn (fn [hid]
(let [node (hid->node hid)
value (when (contains? node :value) (grab :value node))
perm-link? (when (string? value)
(re-find #"Permanent link to this comic" value))]
perm-link?))
found-hids (find-hids-with root-hid [:** :*] hid-keep-fn)
link-node (hid->node (only found-hids)) ; assume there is only 1 link node
value-str (grab :value link-node) ; "nPermanent link to this comic: https://xkcd.com/1988/"
result (re-find #"http.*$" value-str)]
;(spyx-pretty link-node) ;=> :tupelo.forest/khids ,
; :tag :tupelo.forest/raw,
; :value "nPermanent link to this comic: https://xkcd.com/1988/"
;(spyx result) ; => "https://xkcd.com/1988/"
)))
Documentation is ongoing, but you can see a lightning talk from the Clojure Conj 2017.
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I'm not sure it's much shorter than what you wrote, but finding stuff in any tree-like data structure is what I created the tupelo.forest library for.
Here is a solution for your problem:
(dotest
(when false ; manually enable to grab a new copy of the webpage
(spit "xkcd-sample.html"
(slurp "https://xkcd.com")))
(with-forest (new-forest)
(let [doc (it-> (xkcd)
(drop-if #(= :dtd (:type %)) it)
(only it))
root-hid (add-tree-enlive doc)
>> (remove-whitespace-leaves)
;>> (spyx-pretty (hid->bush root-hid))
hid-keep-fn (fn [hid]
(let [node (hid->node hid)
value (when (contains? node :value) (grab :value node))
perm-link? (when (string? value)
(re-find #"Permanent link to this comic" value))]
perm-link?))
found-hids (find-hids-with root-hid [:** :*] hid-keep-fn)
link-node (hid->node (only found-hids)) ; assume there is only 1 link node
value-str (grab :value link-node) ; "nPermanent link to this comic: https://xkcd.com/1988/"
result (re-find #"http.*$" value-str)]
;(spyx-pretty link-node) ;=> :tupelo.forest/khids ,
; :tag :tupelo.forest/raw,
; :value "nPermanent link to this comic: https://xkcd.com/1988/"
;(spyx result) ; => "https://xkcd.com/1988/"
)))
Documentation is ongoing, but you can see a lightning talk from the Clojure Conj 2017.
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
add a comment |Â
up vote
2
down vote
I'm not sure it's much shorter than what you wrote, but finding stuff in any tree-like data structure is what I created the tupelo.forest library for.
Here is a solution for your problem:
(dotest
(when false ; manually enable to grab a new copy of the webpage
(spit "xkcd-sample.html"
(slurp "https://xkcd.com")))
(with-forest (new-forest)
(let [doc (it-> (xkcd)
(drop-if #(= :dtd (:type %)) it)
(only it))
root-hid (add-tree-enlive doc)
>> (remove-whitespace-leaves)
;>> (spyx-pretty (hid->bush root-hid))
hid-keep-fn (fn [hid]
(let [node (hid->node hid)
value (when (contains? node :value) (grab :value node))
perm-link? (when (string? value)
(re-find #"Permanent link to this comic" value))]
perm-link?))
found-hids (find-hids-with root-hid [:** :*] hid-keep-fn)
link-node (hid->node (only found-hids)) ; assume there is only 1 link node
value-str (grab :value link-node) ; "nPermanent link to this comic: https://xkcd.com/1988/"
result (re-find #"http.*$" value-str)]
;(spyx-pretty link-node) ;=> :tupelo.forest/khids ,
; :tag :tupelo.forest/raw,
; :value "nPermanent link to this comic: https://xkcd.com/1988/"
;(spyx result) ; => "https://xkcd.com/1988/"
)))
Documentation is ongoing, but you can see a lightning talk from the Clojure Conj 2017.
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I'm not sure it's much shorter than what you wrote, but finding stuff in any tree-like data structure is what I created the tupelo.forest library for.
Here is a solution for your problem:
(dotest
(when false ; manually enable to grab a new copy of the webpage
(spit "xkcd-sample.html"
(slurp "https://xkcd.com")))
(with-forest (new-forest)
(let [doc (it-> (xkcd)
(drop-if #(= :dtd (:type %)) it)
(only it))
root-hid (add-tree-enlive doc)
>> (remove-whitespace-leaves)
;>> (spyx-pretty (hid->bush root-hid))
hid-keep-fn (fn [hid]
(let [node (hid->node hid)
value (when (contains? node :value) (grab :value node))
perm-link? (when (string? value)
(re-find #"Permanent link to this comic" value))]
perm-link?))
found-hids (find-hids-with root-hid [:** :*] hid-keep-fn)
link-node (hid->node (only found-hids)) ; assume there is only 1 link node
value-str (grab :value link-node) ; "nPermanent link to this comic: https://xkcd.com/1988/"
result (re-find #"http.*$" value-str)]
;(spyx-pretty link-node) ;=> :tupelo.forest/khids ,
; :tag :tupelo.forest/raw,
; :value "nPermanent link to this comic: https://xkcd.com/1988/"
;(spyx result) ; => "https://xkcd.com/1988/"
)))
Documentation is ongoing, but you can see a lightning talk from the Clojure Conj 2017.
I'm not sure it's much shorter than what you wrote, but finding stuff in any tree-like data structure is what I created the tupelo.forest library for.
Here is a solution for your problem:
(dotest
(when false ; manually enable to grab a new copy of the webpage
(spit "xkcd-sample.html"
(slurp "https://xkcd.com")))
(with-forest (new-forest)
(let [doc (it-> (xkcd)
(drop-if #(= :dtd (:type %)) it)
(only it))
root-hid (add-tree-enlive doc)
>> (remove-whitespace-leaves)
;>> (spyx-pretty (hid->bush root-hid))
hid-keep-fn (fn [hid]
(let [node (hid->node hid)
value (when (contains? node :value) (grab :value node))
perm-link? (when (string? value)
(re-find #"Permanent link to this comic" value))]
perm-link?))
found-hids (find-hids-with root-hid [:** :*] hid-keep-fn)
link-node (hid->node (only found-hids)) ; assume there is only 1 link node
value-str (grab :value link-node) ; "nPermanent link to this comic: https://xkcd.com/1988/"
result (re-find #"http.*$" value-str)]
;(spyx-pretty link-node) ;=> :tupelo.forest/khids ,
; :tag :tupelo.forest/raw,
; :value "nPermanent link to this comic: https://xkcd.com/1988/"
;(spyx result) ; => "https://xkcd.com/1988/"
)))
Documentation is ongoing, but you can see a lightning talk from the Clojure Conj 2017.
answered May 3 at 1:23
Alan Thompson
21114
21114
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
add a comment |Â
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
Oh, I see I forgot to parse out just the integer ID. Oh well.
â Alan Thompson
May 3 at 1:28
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193511%2ffinding-the-comic-id-of-the-last-xkcd-comic-published%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
I do not know anything about closure... but I have a feeling it would be simpler to grab the link for the previous page and add one to the ID.
â Gerrit0
May 3 at 4:04
@Gerrit0 LOL. Probably. But who has time to think about logic before spending a couple hours hacking stuff together?
â Carcigenicate
May 3 at 4:06