Crawl and gather all the urls recursively in a domain
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
This is the script :
#!/bin/bash
set -euo pipefail
SOFT="$(basename $0)"
random=$(( ( RANDOM % 100000 ) ))
tempfolder="/tmp/$SOFT/$random"
mkdir -p "$tempfolder"
pushd "$tempfolder" > /dev/null
cleanup ()
cat wget.txt
trap cleanup EXIT
site="$1"
wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"
the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.
Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.
The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
(something along the lines of "http://...: Nothing there")
url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.
the scripts allows me to fetch all urls in a website and use that as a stream directly.
Example :
$ crawl www.example.com | sed 's/some_treatments//' > super_file
bash stream
add a comment |Â
up vote
3
down vote
favorite
This is the script :
#!/bin/bash
set -euo pipefail
SOFT="$(basename $0)"
random=$(( ( RANDOM % 100000 ) ))
tempfolder="/tmp/$SOFT/$random"
mkdir -p "$tempfolder"
pushd "$tempfolder" > /dev/null
cleanup ()
cat wget.txt
trap cleanup EXIT
site="$1"
wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"
the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.
Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.
The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
(something along the lines of "http://...: Nothing there")
url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.
the scripts allows me to fetch all urls in a website and use that as a stream directly.
Example :
$ crawl www.example.com | sed 's/some_treatments//' > super_file
bash stream
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
This is the script :
#!/bin/bash
set -euo pipefail
SOFT="$(basename $0)"
random=$(( ( RANDOM % 100000 ) ))
tempfolder="/tmp/$SOFT/$random"
mkdir -p "$tempfolder"
pushd "$tempfolder" > /dev/null
cleanup ()
cat wget.txt
trap cleanup EXIT
site="$1"
wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"
the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.
Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.
The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
(something along the lines of "http://...: Nothing there")
url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.
the scripts allows me to fetch all urls in a website and use that as a stream directly.
Example :
$ crawl www.example.com | sed 's/some_treatments//' > super_file
bash stream
This is the script :
#!/bin/bash
set -euo pipefail
SOFT="$(basename $0)"
random=$(( ( RANDOM % 100000 ) ))
tempfolder="/tmp/$SOFT/$random"
mkdir -p "$tempfolder"
pushd "$tempfolder" > /dev/null
cleanup ()
cat wget.txt
trap cleanup EXIT
site="$1"
wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"
the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.
Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.
The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
(something along the lines of "http://...: Nothing there")
url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.
the scripts allows me to fetch all urls in a website and use that as a stream directly.
Example :
$ crawl www.example.com | sed 's/some_treatments//' > super_file
bash stream
asked Jul 12 at 8:05
Pierre Antoine Guillaume
185
185
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
There's no need to use cat
here:
cat wget.txt | sed 's/:$//' | url_matcher
Instead, simply have sed
take its input from the file:
sed 's/:$//' wget.txt | url_matcher
Instead of using $RANDOM
and mkdir
, we can use the mktemp
command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):
tempfolder=$(mktemp -d)
I think that now removes the only Bashism in the script.
Avoid pushd
and popd
in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder"
and something like cd ..
to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):
dir=$(mktemp -d)
cleanup() url_matcher
rm -rf -- "$dir"
trap cleanup EXIT
wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"
Other alternatives to pushd
/popd
include saving $PWD
into a variable before changing directory, or using cd -
(which internally does exactly that, as cd
saves $PWD
into $OLDPWD
for that purpose).
We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file
. If we're on Linux, we can write to /dev/stdout
instead of a file:
wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1"
| sed -e 's/:$//' | url_matcher
If we're happy to accept a Bash dependency again, we can use process substitution:
wget -nd --spider --recursive --no-verbose
--output-file=>(sed -e 's/:$//' | url_matcher) "$1"
However, neither of those dependencies are necessary, as wget
follows the common convention of treating the filename -
specially, to mean "standard output stream":
wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Modified code
#!/bin/bash
set -uo pipefail
exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put thesed
after thewget
" - do you mean that you'd like to pipe thewget
output intosed
? That's certainly possible, and I'll edit to show how.
â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of--output-file
, why not use-O-
?
â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot thatwget
observes the-
convention. I'll edit that in.
â Toby Speight
Jul 13 at 7:56
add a comment |Â
up vote
1
down vote
A few points on top of Toby's excellent review.
This is not correctly protected from word splitting, and will fail when the script's path contains spaces:
SOFT="$(basename $0)"
The only thing you need to double quote there is the $0
:
SOFT=$(basename "$0")
Since RANDOM
generates a random integer in the range 0 - 32767, the % 100000
is unnecessary.
The popd
is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.
I think thepopd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
There's no need to use cat
here:
cat wget.txt | sed 's/:$//' | url_matcher
Instead, simply have sed
take its input from the file:
sed 's/:$//' wget.txt | url_matcher
Instead of using $RANDOM
and mkdir
, we can use the mktemp
command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):
tempfolder=$(mktemp -d)
I think that now removes the only Bashism in the script.
Avoid pushd
and popd
in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder"
and something like cd ..
to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):
dir=$(mktemp -d)
cleanup() url_matcher
rm -rf -- "$dir"
trap cleanup EXIT
wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"
Other alternatives to pushd
/popd
include saving $PWD
into a variable before changing directory, or using cd -
(which internally does exactly that, as cd
saves $PWD
into $OLDPWD
for that purpose).
We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file
. If we're on Linux, we can write to /dev/stdout
instead of a file:
wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1"
| sed -e 's/:$//' | url_matcher
If we're happy to accept a Bash dependency again, we can use process substitution:
wget -nd --spider --recursive --no-verbose
--output-file=>(sed -e 's/:$//' | url_matcher) "$1"
However, neither of those dependencies are necessary, as wget
follows the common convention of treating the filename -
specially, to mean "standard output stream":
wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Modified code
#!/bin/bash
set -uo pipefail
exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put thesed
after thewget
" - do you mean that you'd like to pipe thewget
output intosed
? That's certainly possible, and I'll edit to show how.
â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of--output-file
, why not use-O-
?
â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot thatwget
observes the-
convention. I'll edit that in.
â Toby Speight
Jul 13 at 7:56
add a comment |Â
up vote
2
down vote
accepted
There's no need to use cat
here:
cat wget.txt | sed 's/:$//' | url_matcher
Instead, simply have sed
take its input from the file:
sed 's/:$//' wget.txt | url_matcher
Instead of using $RANDOM
and mkdir
, we can use the mktemp
command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):
tempfolder=$(mktemp -d)
I think that now removes the only Bashism in the script.
Avoid pushd
and popd
in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder"
and something like cd ..
to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):
dir=$(mktemp -d)
cleanup() url_matcher
rm -rf -- "$dir"
trap cleanup EXIT
wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"
Other alternatives to pushd
/popd
include saving $PWD
into a variable before changing directory, or using cd -
(which internally does exactly that, as cd
saves $PWD
into $OLDPWD
for that purpose).
We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file
. If we're on Linux, we can write to /dev/stdout
instead of a file:
wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1"
| sed -e 's/:$//' | url_matcher
If we're happy to accept a Bash dependency again, we can use process substitution:
wget -nd --spider --recursive --no-verbose
--output-file=>(sed -e 's/:$//' | url_matcher) "$1"
However, neither of those dependencies are necessary, as wget
follows the common convention of treating the filename -
specially, to mean "standard output stream":
wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Modified code
#!/bin/bash
set -uo pipefail
exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put thesed
after thewget
" - do you mean that you'd like to pipe thewget
output intosed
? That's certainly possible, and I'll edit to show how.
â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of--output-file
, why not use-O-
?
â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot thatwget
observes the-
convention. I'll edit that in.
â Toby Speight
Jul 13 at 7:56
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
There's no need to use cat
here:
cat wget.txt | sed 's/:$//' | url_matcher
Instead, simply have sed
take its input from the file:
sed 's/:$//' wget.txt | url_matcher
Instead of using $RANDOM
and mkdir
, we can use the mktemp
command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):
tempfolder=$(mktemp -d)
I think that now removes the only Bashism in the script.
Avoid pushd
and popd
in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder"
and something like cd ..
to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):
dir=$(mktemp -d)
cleanup() url_matcher
rm -rf -- "$dir"
trap cleanup EXIT
wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"
Other alternatives to pushd
/popd
include saving $PWD
into a variable before changing directory, or using cd -
(which internally does exactly that, as cd
saves $PWD
into $OLDPWD
for that purpose).
We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file
. If we're on Linux, we can write to /dev/stdout
instead of a file:
wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1"
| sed -e 's/:$//' | url_matcher
If we're happy to accept a Bash dependency again, we can use process substitution:
wget -nd --spider --recursive --no-verbose
--output-file=>(sed -e 's/:$//' | url_matcher) "$1"
However, neither of those dependencies are necessary, as wget
follows the common convention of treating the filename -
specially, to mean "standard output stream":
wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Modified code
#!/bin/bash
set -uo pipefail
exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
There's no need to use cat
here:
cat wget.txt | sed 's/:$//' | url_matcher
Instead, simply have sed
take its input from the file:
sed 's/:$//' wget.txt | url_matcher
Instead of using $RANDOM
and mkdir
, we can use the mktemp
command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):
tempfolder=$(mktemp -d)
I think that now removes the only Bashism in the script.
Avoid pushd
and popd
in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder"
and something like cd ..
to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):
dir=$(mktemp -d)
cleanup() url_matcher
rm -rf -- "$dir"
trap cleanup EXIT
wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"
Other alternatives to pushd
/popd
include saving $PWD
into a variable before changing directory, or using cd -
(which internally does exactly that, as cd
saves $PWD
into $OLDPWD
for that purpose).
We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file
. If we're on Linux, we can write to /dev/stdout
instead of a file:
wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1"
| sed -e 's/:$//' | url_matcher
If we're happy to accept a Bash dependency again, we can use process substitution:
wget -nd --spider --recursive --no-verbose
--output-file=>(sed -e 's/:$//' | url_matcher) "$1"
However, neither of those dependencies are necessary, as wget
follows the common convention of treating the filename -
specially, to mean "standard output stream":
wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
Modified code
#!/bin/bash
set -uo pipefail
exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
| sed -e 's/:$//' | url_matcher
edited Jul 13 at 8:41
answered Jul 12 at 8:56
Toby Speight
17.2k13486
17.2k13486
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put thesed
after thewget
" - do you mean that you'd like to pipe thewget
output intosed
? That's certainly possible, and I'll edit to show how.
â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of--output-file
, why not use-O-
?
â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot thatwget
observes the-
convention. I'll edit that in.
â Toby Speight
Jul 13 at 7:56
add a comment |Â
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put thesed
after thewget
" - do you mean that you'd like to pipe thewget
output intosed
? That's certainly possible, and I'll edit to show how.
â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of--output-file
, why not use-O-
?
â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot thatwget
observes the-
convention. I'll edit that in.
â Toby Speight
Jul 13 at 7:56
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
Neat ! Why can't I put the sed ... after the wget though ?
â Pierre Antoine Guillaume
Jul 12 at 9:42
I'm not sure what you mean by "put the
sed
after the wget
" - do you mean that you'd like to pipe the wget
output into sed
? That's certainly possible, and I'll edit to show how.â Toby Speight
Jul 12 at 9:44
I'm not sure what you mean by "put the
sed
after the wget
" - do you mean that you'd like to pipe the wget
output into sed
? That's certainly possible, and I'll edit to show how.â Toby Speight
Jul 12 at 9:44
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
â Pierre Antoine Guillaume
Jul 12 at 9:46
Instead of
--output-file
, why not use -O-
?â janos
Jul 13 at 5:00
Instead of
--output-file
, why not use -O-
?â janos
Jul 13 at 5:00
@janos, why not, indeed? I just forgot that
wget
observes the -
convention. I'll edit that in.â Toby Speight
Jul 13 at 7:56
@janos, why not, indeed? I just forgot that
wget
observes the -
convention. I'll edit that in.â Toby Speight
Jul 13 at 7:56
add a comment |Â
up vote
1
down vote
A few points on top of Toby's excellent review.
This is not correctly protected from word splitting, and will fail when the script's path contains spaces:
SOFT="$(basename $0)"
The only thing you need to double quote there is the $0
:
SOFT=$(basename "$0")
Since RANDOM
generates a random integer in the range 0 - 32767, the % 100000
is unnecessary.
The popd
is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.
I think thepopd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
add a comment |Â
up vote
1
down vote
A few points on top of Toby's excellent review.
This is not correctly protected from word splitting, and will fail when the script's path contains spaces:
SOFT="$(basename $0)"
The only thing you need to double quote there is the $0
:
SOFT=$(basename "$0")
Since RANDOM
generates a random integer in the range 0 - 32767, the % 100000
is unnecessary.
The popd
is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.
I think thepopd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
add a comment |Â
up vote
1
down vote
up vote
1
down vote
A few points on top of Toby's excellent review.
This is not correctly protected from word splitting, and will fail when the script's path contains spaces:
SOFT="$(basename $0)"
The only thing you need to double quote there is the $0
:
SOFT=$(basename "$0")
Since RANDOM
generates a random integer in the range 0 - 32767, the % 100000
is unnecessary.
The popd
is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.
A few points on top of Toby's excellent review.
This is not correctly protected from word splitting, and will fail when the script's path contains spaces:
SOFT="$(basename $0)"
The only thing you need to double quote there is the $0
:
SOFT=$(basename "$0")
Since RANDOM
generates a random integer in the range 0 - 32767, the % 100000
is unnecessary.
The popd
is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.
answered Jul 13 at 5:11
janos
95.2k12119342
95.2k12119342
I think thepopd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
add a comment |Â
I think thepopd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
I think the
popd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)â Toby Speight
Jul 13 at 8:15
I think the
popd
was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)â Toby Speight
Jul 13 at 8:15
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
â Pierre Antoine Guillaume
Jul 13 at 12:39
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198343%2fcrawl-and-gather-all-the-urls-recursively-in-a-domain%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password