Crawl and gather all the urls recursively in a domain

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












This is the script :



#!/bin/bash

set -euo pipefail

SOFT="$(basename $0)"
random=$(( ( RANDOM % 100000 ) ))
tempfolder="/tmp/$SOFT/$random"

mkdir -p "$tempfolder"
pushd "$tempfolder" > /dev/null
cleanup ()
cat wget.txt

trap cleanup EXIT
site="$1"

wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"


the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.



Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.



The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
(something along the lines of "http://...: Nothing there")



url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.



the scripts allows me to fetch all urls in a website and use that as a stream directly.



Example :



$ crawl www.example.com | sed 's/some_treatments//' > super_file






share|improve this question

























    up vote
    3
    down vote

    favorite












    This is the script :



    #!/bin/bash

    set -euo pipefail

    SOFT="$(basename $0)"
    random=$(( ( RANDOM % 100000 ) ))
    tempfolder="/tmp/$SOFT/$random"

    mkdir -p "$tempfolder"
    pushd "$tempfolder" > /dev/null
    cleanup ()
    cat wget.txt

    trap cleanup EXIT
    site="$1"

    wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"


    the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.



    Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.



    The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
    (something along the lines of "http://...: Nothing there")



    url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.



    the scripts allows me to fetch all urls in a website and use that as a stream directly.



    Example :



    $ crawl www.example.com | sed 's/some_treatments//' > super_file






    share|improve this question





















      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      This is the script :



      #!/bin/bash

      set -euo pipefail

      SOFT="$(basename $0)"
      random=$(( ( RANDOM % 100000 ) ))
      tempfolder="/tmp/$SOFT/$random"

      mkdir -p "$tempfolder"
      pushd "$tempfolder" > /dev/null
      cleanup ()
      cat wget.txt

      trap cleanup EXIT
      site="$1"

      wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"


      the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.



      Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.



      The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
      (something along the lines of "http://...: Nothing there")



      url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.



      the scripts allows me to fetch all urls in a website and use that as a stream directly.



      Example :



      $ crawl www.example.com | sed 's/some_treatments//' > super_file






      share|improve this question











      This is the script :



      #!/bin/bash

      set -euo pipefail

      SOFT="$(basename $0)"
      random=$(( ( RANDOM % 100000 ) ))
      tempfolder="/tmp/$SOFT/$random"

      mkdir -p "$tempfolder"
      pushd "$tempfolder" > /dev/null
      cleanup ()
      cat wget.txt

      trap cleanup EXIT
      site="$1"

      wget -nd --spider --recursive --no-verbose --output-file=wget.txt "$site"


      the script essentially creates a subfolder in /tmp/crawl and puts into it the result of wget, then feeds it to sed and url_matcher.



      Initially I though of putting the cat instruction just after the wget one, but for long wget it didn't work.



      The sed instruction removes colons at the end of lines, which is added by wget when said url isn't valid.
      (something along the lines of "http://...: Nothing there")



      url_matcher is a c++ scanner made with flex, which recognises urls in a text fed with standard input and prints them to standard output, separated with new lines characters.



      the scripts allows me to fetch all urls in a website and use that as a stream directly.



      Example :



      $ crawl www.example.com | sed 's/some_treatments//' > super_file








      share|improve this question










      share|improve this question




      share|improve this question









      asked Jul 12 at 8:05









      Pierre Antoine Guillaume

      185




      185




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          There's no need to use cat here:



          cat wget.txt | sed 's/:$//' | url_matcher


          Instead, simply have sed take its input from the file:



          sed 's/:$//' wget.txt | url_matcher



          Instead of using $RANDOM and mkdir, we can use the mktemp command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):



          tempfolder=$(mktemp -d)


          I think that now removes the only Bashism in the script.




          Avoid pushd and popd in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder" and something like cd .. to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):



          dir=$(mktemp -d)

          cleanup() url_matcher
          rm -rf -- "$dir"


          trap cleanup EXIT

          wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"


          Other alternatives to pushd/popd include saving $PWD into a variable before changing directory, or using cd - (which internally does exactly that, as cd saves $PWD into $OLDPWD for that purpose).




          We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file. If we're on Linux, we can write to /dev/stdout instead of a file:



          wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1" 
          | sed -e 's/:$//' | url_matcher


          If we're happy to accept a Bash dependency again, we can use process substitution:



          wget -nd --spider --recursive --no-verbose 
          --output-file=>(sed -e 's/:$//' | url_matcher) "$1"


          However, neither of those dependencies are necessary, as wget follows the common convention of treating the filename - specially, to mean "standard output stream":



          wget -nd --spider --recursive --no-verbose --output-file=- "$1" 
          | sed -e 's/:$//' | url_matcher



          Modified code



          #!/bin/bash

          set -uo pipefail

          exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
          | sed -e 's/:$//' | url_matcher





          share|improve this answer























          • Neat ! Why can't I put the sed ... after the wget though ?
            – Pierre Antoine Guillaume
            Jul 12 at 9:42










          • I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
            – Toby Speight
            Jul 12 at 9:44










          • that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
            – Pierre Antoine Guillaume
            Jul 12 at 9:46










          • Instead of --output-file, why not use -O-?
            – janos
            Jul 13 at 5:00










          • @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
            – Toby Speight
            Jul 13 at 7:56

















          up vote
          1
          down vote













          A few points on top of Toby's excellent review.



          This is not correctly protected from word splitting, and will fail when the script's path contains spaces:




          SOFT="$(basename $0)"



          The only thing you need to double quote there is the $0:



          SOFT=$(basename "$0")


          Since RANDOM generates a random integer in the range 0 - 32767, the % 100000 is unnecessary.



          The popd is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.






          share|improve this answer





















          • I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
            – Toby Speight
            Jul 13 at 8:15










          • I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
            – Pierre Antoine Guillaume
            Jul 13 at 12:39










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198343%2fcrawl-and-gather-all-the-urls-recursively-in-a-domain%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          There's no need to use cat here:



          cat wget.txt | sed 's/:$//' | url_matcher


          Instead, simply have sed take its input from the file:



          sed 's/:$//' wget.txt | url_matcher



          Instead of using $RANDOM and mkdir, we can use the mktemp command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):



          tempfolder=$(mktemp -d)


          I think that now removes the only Bashism in the script.




          Avoid pushd and popd in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder" and something like cd .. to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):



          dir=$(mktemp -d)

          cleanup() url_matcher
          rm -rf -- "$dir"


          trap cleanup EXIT

          wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"


          Other alternatives to pushd/popd include saving $PWD into a variable before changing directory, or using cd - (which internally does exactly that, as cd saves $PWD into $OLDPWD for that purpose).




          We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file. If we're on Linux, we can write to /dev/stdout instead of a file:



          wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1" 
          | sed -e 's/:$//' | url_matcher


          If we're happy to accept a Bash dependency again, we can use process substitution:



          wget -nd --spider --recursive --no-verbose 
          --output-file=>(sed -e 's/:$//' | url_matcher) "$1"


          However, neither of those dependencies are necessary, as wget follows the common convention of treating the filename - specially, to mean "standard output stream":



          wget -nd --spider --recursive --no-verbose --output-file=- "$1" 
          | sed -e 's/:$//' | url_matcher



          Modified code



          #!/bin/bash

          set -uo pipefail

          exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
          | sed -e 's/:$//' | url_matcher





          share|improve this answer























          • Neat ! Why can't I put the sed ... after the wget though ?
            – Pierre Antoine Guillaume
            Jul 12 at 9:42










          • I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
            – Toby Speight
            Jul 12 at 9:44










          • that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
            – Pierre Antoine Guillaume
            Jul 12 at 9:46










          • Instead of --output-file, why not use -O-?
            – janos
            Jul 13 at 5:00










          • @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
            – Toby Speight
            Jul 13 at 7:56














          up vote
          2
          down vote



          accepted










          There's no need to use cat here:



          cat wget.txt | sed 's/:$//' | url_matcher


          Instead, simply have sed take its input from the file:



          sed 's/:$//' wget.txt | url_matcher



          Instead of using $RANDOM and mkdir, we can use the mktemp command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):



          tempfolder=$(mktemp -d)


          I think that now removes the only Bashism in the script.




          Avoid pushd and popd in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder" and something like cd .. to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):



          dir=$(mktemp -d)

          cleanup() url_matcher
          rm -rf -- "$dir"


          trap cleanup EXIT

          wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"


          Other alternatives to pushd/popd include saving $PWD into a variable before changing directory, or using cd - (which internally does exactly that, as cd saves $PWD into $OLDPWD for that purpose).




          We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file. If we're on Linux, we can write to /dev/stdout instead of a file:



          wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1" 
          | sed -e 's/:$//' | url_matcher


          If we're happy to accept a Bash dependency again, we can use process substitution:



          wget -nd --spider --recursive --no-verbose 
          --output-file=>(sed -e 's/:$//' | url_matcher) "$1"


          However, neither of those dependencies are necessary, as wget follows the common convention of treating the filename - specially, to mean "standard output stream":



          wget -nd --spider --recursive --no-verbose --output-file=- "$1" 
          | sed -e 's/:$//' | url_matcher



          Modified code



          #!/bin/bash

          set -uo pipefail

          exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
          | sed -e 's/:$//' | url_matcher





          share|improve this answer























          • Neat ! Why can't I put the sed ... after the wget though ?
            – Pierre Antoine Guillaume
            Jul 12 at 9:42










          • I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
            – Toby Speight
            Jul 12 at 9:44










          • that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
            – Pierre Antoine Guillaume
            Jul 12 at 9:46










          • Instead of --output-file, why not use -O-?
            – janos
            Jul 13 at 5:00










          • @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
            – Toby Speight
            Jul 13 at 7:56












          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          There's no need to use cat here:



          cat wget.txt | sed 's/:$//' | url_matcher


          Instead, simply have sed take its input from the file:



          sed 's/:$//' wget.txt | url_matcher



          Instead of using $RANDOM and mkdir, we can use the mktemp command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):



          tempfolder=$(mktemp -d)


          I think that now removes the only Bashism in the script.




          Avoid pushd and popd in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder" and something like cd .. to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):



          dir=$(mktemp -d)

          cleanup() url_matcher
          rm -rf -- "$dir"


          trap cleanup EXIT

          wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"


          Other alternatives to pushd/popd include saving $PWD into a variable before changing directory, or using cd - (which internally does exactly that, as cd saves $PWD into $OLDPWD for that purpose).




          We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file. If we're on Linux, we can write to /dev/stdout instead of a file:



          wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1" 
          | sed -e 's/:$//' | url_matcher


          If we're happy to accept a Bash dependency again, we can use process substitution:



          wget -nd --spider --recursive --no-verbose 
          --output-file=>(sed -e 's/:$//' | url_matcher) "$1"


          However, neither of those dependencies are necessary, as wget follows the common convention of treating the filename - specially, to mean "standard output stream":



          wget -nd --spider --recursive --no-verbose --output-file=- "$1" 
          | sed -e 's/:$//' | url_matcher



          Modified code



          #!/bin/bash

          set -uo pipefail

          exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
          | sed -e 's/:$//' | url_matcher





          share|improve this answer















          There's no need to use cat here:



          cat wget.txt | sed 's/:$//' | url_matcher


          Instead, simply have sed take its input from the file:



          sed 's/:$//' wget.txt | url_matcher



          Instead of using $RANDOM and mkdir, we can use the mktemp command (GNU coreutils) to create us a new directory (without risk of conflict with an existing one):



          tempfolder=$(mktemp -d)


          I think that now removes the only Bashism in the script.




          Avoid pushd and popd in scripts (having to redirect the unwanted output should be a sign that these are intended for interactive shells). Instead, we can use cd "$tempfolder" and something like cd .. to leave it (it doesn't matter where we end up, as we only use absolute paths after that). Alternatively, don't change directory at all, but specify paths in full (I'll use a shorter name for the temporary directory):



          dir=$(mktemp -d)

          cleanup() url_matcher
          rm -rf -- "$dir"


          trap cleanup EXIT

          wget -nd --spider --recursive --no-verbose --output-file="$dir/wget.txt" "$1"


          Other alternatives to pushd/popd include saving $PWD into a variable before changing directory, or using cd - (which internally does exactly that, as cd saves $PWD into $OLDPWD for that purpose).




          We can avoid the temporary file altogether, to simplify the code and to allow us to start producing output without waiting for the crawl to finish. We can do this in a couple of ways, by being creative with --output-file. If we're on Linux, we can write to /dev/stdout instead of a file:



          wget -nd --spider --recursive --no-verbose --output-file=/dev/stdout "$1" 
          | sed -e 's/:$//' | url_matcher


          If we're happy to accept a Bash dependency again, we can use process substitution:



          wget -nd --spider --recursive --no-verbose 
          --output-file=>(sed -e 's/:$//' | url_matcher) "$1"


          However, neither of those dependencies are necessary, as wget follows the common convention of treating the filename - specially, to mean "standard output stream":



          wget -nd --spider --recursive --no-verbose --output-file=- "$1" 
          | sed -e 's/:$//' | url_matcher



          Modified code



          #!/bin/bash

          set -uo pipefail

          exec wget -nd --spider --recursive --no-verbose --output-file=- "$1"
          | sed -e 's/:$//' | url_matcher






          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Jul 13 at 8:41


























          answered Jul 12 at 8:56









          Toby Speight

          17.2k13486




          17.2k13486











          • Neat ! Why can't I put the sed ... after the wget though ?
            – Pierre Antoine Guillaume
            Jul 12 at 9:42










          • I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
            – Toby Speight
            Jul 12 at 9:44










          • that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
            – Pierre Antoine Guillaume
            Jul 12 at 9:46










          • Instead of --output-file, why not use -O-?
            – janos
            Jul 13 at 5:00










          • @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
            – Toby Speight
            Jul 13 at 7:56
















          • Neat ! Why can't I put the sed ... after the wget though ?
            – Pierre Antoine Guillaume
            Jul 12 at 9:42










          • I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
            – Toby Speight
            Jul 12 at 9:44










          • that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
            – Pierre Antoine Guillaume
            Jul 12 at 9:46










          • Instead of --output-file, why not use -O-?
            – janos
            Jul 13 at 5:00










          • @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
            – Toby Speight
            Jul 13 at 7:56















          Neat ! Why can't I put the sed ... after the wget though ?
          – Pierre Antoine Guillaume
          Jul 12 at 9:42




          Neat ! Why can't I put the sed ... after the wget though ?
          – Pierre Antoine Guillaume
          Jul 12 at 9:42












          I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
          – Toby Speight
          Jul 12 at 9:44




          I'm not sure what you mean by "put the sed after the wget" - do you mean that you'd like to pipe the wget output into sed? That's certainly possible, and I'll edit to show how.
          – Toby Speight
          Jul 12 at 9:44












          that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
          – Pierre Antoine Guillaume
          Jul 12 at 9:46




          that'd be good, sure. At first, I used to put the sed outside of the cleanup, after the wget. And for "long" crawls (with wget running for a few minutes), the cat - now sed - would not take place at all.
          – Pierre Antoine Guillaume
          Jul 12 at 9:46












          Instead of --output-file, why not use -O-?
          – janos
          Jul 13 at 5:00




          Instead of --output-file, why not use -O-?
          – janos
          Jul 13 at 5:00












          @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
          – Toby Speight
          Jul 13 at 7:56




          @janos, why not, indeed? I just forgot that wget observes the - convention. I'll edit that in.
          – Toby Speight
          Jul 13 at 7:56












          up vote
          1
          down vote













          A few points on top of Toby's excellent review.



          This is not correctly protected from word splitting, and will fail when the script's path contains spaces:




          SOFT="$(basename $0)"



          The only thing you need to double quote there is the $0:



          SOFT=$(basename "$0")


          Since RANDOM generates a random integer in the range 0 - 32767, the % 100000 is unnecessary.



          The popd is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.






          share|improve this answer





















          • I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
            – Toby Speight
            Jul 13 at 8:15










          • I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
            – Pierre Antoine Guillaume
            Jul 13 at 12:39














          up vote
          1
          down vote













          A few points on top of Toby's excellent review.



          This is not correctly protected from word splitting, and will fail when the script's path contains spaces:




          SOFT="$(basename $0)"



          The only thing you need to double quote there is the $0:



          SOFT=$(basename "$0")


          Since RANDOM generates a random integer in the range 0 - 32767, the % 100000 is unnecessary.



          The popd is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.






          share|improve this answer





















          • I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
            – Toby Speight
            Jul 13 at 8:15










          • I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
            – Pierre Antoine Guillaume
            Jul 13 at 12:39












          up vote
          1
          down vote










          up vote
          1
          down vote









          A few points on top of Toby's excellent review.



          This is not correctly protected from word splitting, and will fail when the script's path contains spaces:




          SOFT="$(basename $0)"



          The only thing you need to double quote there is the $0:



          SOFT=$(basename "$0")


          Since RANDOM generates a random integer in the range 0 - 32767, the % 100000 is unnecessary.



          The popd is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.






          share|improve this answer













          A few points on top of Toby's excellent review.



          This is not correctly protected from word splitting, and will fail when the script's path contains spaces:




          SOFT="$(basename $0)"



          The only thing you need to double quote there is the $0:



          SOFT=$(basename "$0")


          Since RANDOM generates a random integer in the range 0 - 32767, the % 100000 is unnecessary.



          The popd is unlikely to ever be useful in a cleanup script. Directory changes within a script have no effect outside an executed (as opposed to sourced script). In any case, it's best to not change directories within scripts at all.







          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Jul 13 at 5:11









          janos

          95.2k12119342




          95.2k12119342











          • I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
            – Toby Speight
            Jul 13 at 8:15










          • I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
            – Pierre Antoine Guillaume
            Jul 13 at 12:39
















          • I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
            – Toby Speight
            Jul 13 at 8:15










          • I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
            – Pierre Antoine Guillaume
            Jul 13 at 12:39















          I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
          – Toby Speight
          Jul 13 at 8:15




          I think the popd was probably there to make the temporary directory not be the working directory when it's removed (possible confusion with attempting to umount a filesystem when a process is using it for working directory, perhaps?)
          – Toby Speight
          Jul 13 at 8:15












          I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
          – Pierre Antoine Guillaume
          Jul 13 at 12:39




          I really appreciate. I am going to change my boilerplate script asap ! The popd was there for the bad reasons ;) thank you ! The cd (or pushd) was there for a "good" reason : when I started the script, I didn't know of the wget -nd option. If you don't use it, you create a real mess of subfolders. hence, create a new tempfolder, cd into it and delete it afterward.
          – Pierre Antoine Guillaume
          Jul 13 at 12:39












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198343%2fcrawl-and-gather-all-the-urls-recursively-in-a-domain%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Will my employers contract hold up in court?