Finding (and counting) duplicate JS/Java files

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I have the following script which takes minutes to give output.

printf "nDuplicate JS Filenames...n"
(
 find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
 echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

printf "nDuplicate Java Filenames...n"
(
 find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
 echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

I know that I make the same request, or similar ones, a couple of times.

How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?

asked Mar 22 at 12:38

user3341592

1205

add a commentÂ |Â

up vote
2
down vote

favorite

I have the following script which takes minutes to give output.

printf "nDuplicate JS Filenames...n"
(
 find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
 echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

printf "nDuplicate Java Filenames...n"
(
 find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
 echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

I know that I make the same request, or similar ones, a couple of times.

How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?

asked Mar 22 at 12:38

user3341592

1205

add a commentÂ |Â

up vote
2
down vote

favorite

I have the following script which takes minutes to give output.

printf "nDuplicate JS Filenames...n"
(
 find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
 echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

printf "nDuplicate Java Filenames...n"
(
 find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
 echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

I know that I make the same request, or similar ones, a couple of times.

How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?

asked Mar 22 at 12:38

user3341592

1205

I have the following script which takes minutes to give output.

printf "nDuplicate JS Filenames...n"
(
 find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.js' | wc -l) JS files in search directory";
 echo "$(find . -name '*.js' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

printf "nDuplicate Java Filenames...n"
(
 find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 ";
 echo "$(find . -type f -name '*.java' | wc -l) Java files in search directory";
 echo "$(find . -name '*.java' -type f -exec basename ; | sort | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found";
)

I know that I make the same request, or similar ones, a couple of times.

How could I optimize this, and maybe already the base command - I'm surprised that find takes so long, or is it due to sort, uniq, and grep?

asked Mar 22 at 12:38

user3341592

1205

asked Mar 22 at 12:38

user3341592

1205

asked Mar 22 at 12:38

user3341592

1205

asked Mar 22 at 12:38

user3341592

1205

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%fn'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

If your system does not come with GNU find (e.g. MacOS, OpenBSD, FreeBSD) and you do not want to install it (the package is usually called findutils), you can use sed to do the same as basename but for all found files at once:

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

answered Mar 22 at 14:10

Adaephon

1936

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190200%2ffinding-and-counting-duplicate-js-java-files%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%fn'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

answered Mar 22 at 14:10

Adaephon

1936

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

add a commentÂ |Â

up vote
4
down vote

accepted

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%fn'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

answered Mar 22 at 14:10

Adaephon

1936

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

add a commentÂ |Â

up vote
4
down vote

accepted

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%fn'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

answered Mar 22 at 14:10

Adaephon

1936

Aside from running essentially the same find command three times, the main issue is that you run a separate basename instance for every single found file.

If you are using GNU find (verify with find --version), you can get find to print the basenames directly:

find . -name '*.js' -type f -printf '%fn'

On my system this is about 900 times faster than calling basename when run on a directory with about 200,000 files in it.

find . -name '*.js' -type f | sed 's@.*/@@'

On my system this is only slightly slower than using -printf.

If you want to reduce the amount of times you run find, you can just save the output in a variable:

filelist="$(find . -name '*.js' -type f -printf '%fn' | sort)"
echo "$filelist" | uniq -c | grep -v "^[ t]*1 ";
echo "$(echo "$filelist" | wc -l) JS files in search directory";
echo "$(echo "$filelist" | uniq -c | grep -v "^[ t]*1 " | wc -l) duplicates found"

Note that on bash you need to put double-quotes around $filelist so that the newlines are not squashed.

answered Mar 22 at 14:10

Adaephon

1936

answered Mar 22 at 14:10

Adaephon

1936

answered Mar 22 at 14:10

Adaephon

1936

answered Mar 22 at 14:10

Adaephon

1936

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

add a commentÂ |Â

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

Alternatively, check whether your basename accepts --multiple arguments.
â€“Â Toby Speight
Mar 22 at 15:02

On my system, the answer is given now in 1.2 sec instead of 134.9 sec. Thanks a lot! And thanks for given explanations which allow me to learn at the same time...
â€“Â user3341592
Mar 22 at 15:27

A question: let's say I'm never interested in seeing some duplicate files (whose name would be hard-coded, such as package-info.java, AllTests.java and Constants.java), how could I remove those lines from the output? I guess chaining grep -v commands one after the other is not the right solution...
â€“Â user3341592
Mar 22 at 15:29

Question posted as codereview.stackexchange.com/questions/190215/â€¦
â€“Â user3341592
Mar 22 at 15:48

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr