Finding (and counting) duplicate JS/Java files
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I have the following script which takes minutes to give output.
printf "nDuplicate JS Filenames...n"
(
find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
printf "nDuplicate Java Filenames...n"
(
find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
I know that I make the same request, or similar ones, a couple of times.
How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?
bash shell sh zsh
add a comment |Â
up vote
2
down vote
favorite
I have the following script which takes minutes to give output.
printf "nDuplicate JS Filenames...n"
(
find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
printf "nDuplicate Java Filenames...n"
(
find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
I know that I make the same request, or similar ones, a couple of times.
How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?
bash shell sh zsh
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have the following script which takes minutes to give output.
printf "nDuplicate JS Filenames...n"
(
find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
printf "nDuplicate Java Filenames...n"
(
find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
I know that I make the same request, or similar ones, a couple of times.
How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?
bash shell sh zsh
I have the following script which takes minutes to give output.
printf "nDuplicate JS Filenames...n"
(
find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
printf "nDuplicate Java Filenames...n"
(
find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)
I know that I make the same request, or similar ones, a couple of times.
How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?
bash shell sh zsh
asked Mar 22 at 12:38
user3341592
1205
1205
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%fn'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
Alternatively, check whether your basename accepts--multiple
arguments.
â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such aspackage-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution...
â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%fn'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
Alternatively, check whether your basename accepts--multiple
arguments.
â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such aspackage-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution...
â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
add a comment |Â
up vote
4
down vote
accepted
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%fn'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
Alternatively, check whether your basename accepts--multiple
arguments.
â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such aspackage-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution...
â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%fn'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
Aside from running essentially the same find
command three times, the main issue is that you run a separate basename
instance for every single found file.
If you are using GNU find
(verify with find --version
), you can get find to print the basenames directly:
find . -name '*.js' -type f -printf '%fn'
On my system this is about 900 times faster than calling basename
when run on a directory with about 200,000 files in it.
If your system does not come with GNU find
(e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils
), you can use sed
to do the same as basename
but for all found files at once:
find . -name '*.js' -type f | sed 's@.*/@@'
On my system this is only slightly slower than using -printf
.
If you want to reduce the amount of times you run find
, you can just save the output in a variable:
filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"
Note that on bash
you need to put double-quotes around $filelist
so that the newlines are not squashed.
answered Mar 22 at 14:10
Adaephon
1936
1936
Alternatively, check whether your basename accepts--multiple
arguments.
â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such aspackage-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution...
â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
add a comment |Â
Alternatively, check whether your basename accepts--multiple
arguments.
â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such aspackage-info.java
,AllTests.java
andConstants.java
), how could I remove those lines from the output? I guess chaininggrep -v
commands one after the other is not the right solution...
â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
Alternatively, check whether your basename accepts
--multiple
arguments.â Toby Speight
Mar 22 at 15:02
Alternatively, check whether your basename accepts
--multiple
arguments.â Toby Speight
Mar 22 at 15:02
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â user3341592
Mar 22 at 15:27
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as
package-info.java
, AllTests.java
and Constants.java
), how could I remove those lines from the output? I guess chaining grep -v
commands one after the other is not the right solution...â user3341592
Mar 22 at 15:29
A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as
package-info.java
, AllTests.java
and Constants.java
), how could I remove those lines from the output? I guess chaining grep -v
commands one after the other is not the right solution...â user3341592
Mar 22 at 15:29
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
Question posted as codereview.stackexchange.com/questions/190215/â¦
â user3341592
Mar 22 at 15:48
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190200%2ffinding-and-counting-duplicate-js-java-files%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password