Finding (and counting) duplicate JS/Java files

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I have the following script which takes minutes to give output.



printf "nDuplicate JS Filenames...n"
(
find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

printf "nDuplicate Java Filenames...n"
(
find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)


I know that I make the same request, or similar ones, a couple of times.



How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?







share|improve this question

























    up vote
    2
    down vote

    favorite












    I have the following script which takes minutes to give output.



    printf "nDuplicate JS Filenames...n"
    (
    find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
    echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
    echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
    )

    printf "nDuplicate Java Filenames...n"
    (
    find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
    echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
    echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
    )


    I know that I make the same request, or similar ones, a couple of times.



    How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?







    share|improve this question





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have the following script which takes minutes to give output.



      printf "nDuplicate JS Filenames...n"
      (
      find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
      echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
      echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
      )

      printf "nDuplicate Java Filenames...n"
      (
      find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
      echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
      echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
      )


      I know that I make the same request, or similar ones, a couple of times.



      How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?







      share|improve this question











      I have the following script which takes minutes to give output.



      printf "nDuplicate JS Filenames...n"
      (
      find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
      echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
      echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
      )

      printf "nDuplicate Java Filenames...n"
      (
      find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
      echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
      echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
      )


      I know that I make the same request, or similar ones, a couple of times.



      How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?









      share|improve this question










      share|improve this question




      share|improve this question









      asked Mar 22 at 12:38









      user3341592

      1205




      1205




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.



          If you are using GNU find (verify with find --version), you can get find to print the basenames directly:



          find . -name '*.js' -type f -printf '%fn'


          On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.



          If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:



          find . -name '*.js' -type f | sed 's@.*/@@'


          On my system this is only slightly slower than using -printf.




          If you want to reduce the amount of times you run find, you can just save the output in a variable:



          filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
          echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
          echo "$(echo "$filelist" | wc -l) JS files in search directory";
          echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"


          Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.






          share|improve this answer





















          • Alternatively, check whether your basename accepts --multiple arguments.
            – Toby Speight
            Mar 22 at 15:02










          • On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
            – user3341592
            Mar 22 at 15:27











          • A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
            – user3341592
            Mar 22 at 15:29










          • Question posted as codereview.stackexchange.com/questions/190215/…
            – user3341592
            Mar 22 at 15:48










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190200%2ffinding-and-counting-duplicate-js-java-files%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          4
          down vote



          accepted










          Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.



          If you are using GNU find (verify with find --version), you can get find to print the basenames directly:



          find . -name '*.js' -type f -printf '%fn'


          On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.



          If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:



          find . -name '*.js' -type f | sed 's@.*/@@'


          On my system this is only slightly slower than using -printf.




          If you want to reduce the amount of times you run find, you can just save the output in a variable:



          filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
          echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
          echo "$(echo "$filelist" | wc -l) JS files in search directory";
          echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"


          Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.






          share|improve this answer





















          • Alternatively, check whether your basename accepts --multiple arguments.
            – Toby Speight
            Mar 22 at 15:02










          • On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
            – user3341592
            Mar 22 at 15:27











          • A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
            – user3341592
            Mar 22 at 15:29










          • Question posted as codereview.stackexchange.com/questions/190215/…
            – user3341592
            Mar 22 at 15:48














          up vote
          4
          down vote



          accepted










          Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.



          If you are using GNU find (verify with find --version), you can get find to print the basenames directly:



          find . -name '*.js' -type f -printf '%fn'


          On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.



          If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:



          find . -name '*.js' -type f | sed 's@.*/@@'


          On my system this is only slightly slower than using -printf.




          If you want to reduce the amount of times you run find, you can just save the output in a variable:



          filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
          echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
          echo "$(echo "$filelist" | wc -l) JS files in search directory";
          echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"


          Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.






          share|improve this answer





















          • Alternatively, check whether your basename accepts --multiple arguments.
            – Toby Speight
            Mar 22 at 15:02










          • On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
            – user3341592
            Mar 22 at 15:27











          • A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
            – user3341592
            Mar 22 at 15:29










          • Question posted as codereview.stackexchange.com/questions/190215/…
            – user3341592
            Mar 22 at 15:48












          up vote
          4
          down vote



          accepted







          up vote
          4
          down vote



          accepted






          Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.



          If you are using GNU find (verify with find --version), you can get find to print the basenames directly:



          find . -name '*.js' -type f -printf '%fn'


          On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.



          If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:



          find . -name '*.js' -type f | sed 's@.*/@@'


          On my system this is only slightly slower than using -printf.




          If you want to reduce the amount of times you run find, you can just save the output in a variable:



          filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
          echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
          echo "$(echo "$filelist" | wc -l) JS files in search directory";
          echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"


          Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.






          share|improve this answer













          Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.



          If you are using GNU find (verify with find --version), you can get find to print the basenames directly:



          find . -name '*.js' -type f -printf '%fn'


          On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.



          If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:



          find . -name '*.js' -type f | sed 's@.*/@@'


          On my system this is only slightly slower than using -printf.




          If you want to reduce the amount of times you run find, you can just save the output in a variable:



          filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
          echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
          echo "$(echo "$filelist" | wc -l) JS files in search directory";
          echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"


          Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.







          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Mar 22 at 14:10









          Adaephon

          1936




          1936











          • Alternatively, check whether your basename accepts --multiple arguments.
            – Toby Speight
            Mar 22 at 15:02










          • On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
            – user3341592
            Mar 22 at 15:27











          • A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
            – user3341592
            Mar 22 at 15:29










          • Question posted as codereview.stackexchange.com/questions/190215/…
            – user3341592
            Mar 22 at 15:48
















          • Alternatively, check whether your basename accepts --multiple arguments.
            – Toby Speight
            Mar 22 at 15:02










          • On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
            – user3341592
            Mar 22 at 15:27











          • A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
            – user3341592
            Mar 22 at 15:29










          • Question posted as codereview.stackexchange.com/questions/190215/…
            – user3341592
            Mar 22 at 15:48















          Alternatively, check whether your basename accepts --multiple arguments.
          – Toby Speight
          Mar 22 at 15:02




          Alternatively, check whether your basename accepts --multiple arguments.
          – Toby Speight
          Mar 22 at 15:02












          On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
          – user3341592
          Mar 22 at 15:27





          On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
          – user3341592
          Mar 22 at 15:27













          A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
          – user3341592
          Mar 22 at 15:29




          A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
          – user3341592
          Mar 22 at 15:29












          Question posted as codereview.stackexchange.com/questions/190215/…
          – user3341592
          Mar 22 at 15:48




          Question posted as codereview.stackexchange.com/questions/190215/…
          – user3341592
          Mar 22 at 15:48












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190200%2ffinding-and-counting-duplicate-js-java-files%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation