Go program to find duplicated files in a directory (recursively)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.



It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.



Here is how it works:



  • n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named input in my code.

  • 1 goroutine is spawned to recursively search for files in the direvtory, and populate the input channel with file names.

  • The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].

Finally we just display the duplicates.



Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.



EDIT: Improved my initial code with flags and error management.



package main

import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)

var dir string
var workers int

type Result struct
file string
sha256 [32]byte


func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,


wg.Done()


func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func main()

flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()

fmt.Printf("Searching in %s using %d workers...n", dir, workers)

input := make(chan string)
results := make(chan *Result)

wg := sync.WaitGroup
wg.Add(workers)

for i := 0; i < workers; i++
go worker(input, results, &wg)


go search(input)
go func()
wg.Wait()
close(results)
()

counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)











share|improve this question





















  • hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
    – Randy Howard
    Feb 10 at 3:39










  • Thanks, I did it and added flags to set directory and number of workers :-)
    – Thibaut D.
    Feb 10 at 9:24
















up vote
2
down vote

favorite












This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.



It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.



Here is how it works:



  • n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named input in my code.

  • 1 goroutine is spawned to recursively search for files in the direvtory, and populate the input channel with file names.

  • The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].

Finally we just display the duplicates.



Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.



EDIT: Improved my initial code with flags and error management.



package main

import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)

var dir string
var workers int

type Result struct
file string
sha256 [32]byte


func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,


wg.Done()


func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func main()

flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()

fmt.Printf("Searching in %s using %d workers...n", dir, workers)

input := make(chan string)
results := make(chan *Result)

wg := sync.WaitGroup
wg.Add(workers)

for i := 0; i < workers; i++
go worker(input, results, &wg)


go search(input)
go func()
wg.Wait()
close(results)
()

counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)











share|improve this question





















  • hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
    – Randy Howard
    Feb 10 at 3:39










  • Thanks, I did it and added flags to set directory and number of workers :-)
    – Thibaut D.
    Feb 10 at 9:24












up vote
2
down vote

favorite









up vote
2
down vote

favorite











This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.



It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.



Here is how it works:



  • n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named input in my code.

  • 1 goroutine is spawned to recursively search for files in the direvtory, and populate the input channel with file names.

  • The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].

Finally we just display the duplicates.



Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.



EDIT: Improved my initial code with flags and error management.



package main

import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)

var dir string
var workers int

type Result struct
file string
sha256 [32]byte


func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,


wg.Done()


func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func main()

flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()

fmt.Printf("Searching in %s using %d workers...n", dir, workers)

input := make(chan string)
results := make(chan *Result)

wg := sync.WaitGroup
wg.Add(workers)

for i := 0; i < workers; i++
go worker(input, results, &wg)


go search(input)
go func()
wg.Wait()
close(results)
()

counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)











share|improve this question













This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.



It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.



Here is how it works:



  • n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named input in my code.

  • 1 goroutine is spawned to recursively search for files in the direvtory, and populate the input channel with file names.

  • The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].

Finally we just display the duplicates.



Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.



EDIT: Improved my initial code with flags and error management.



package main

import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)

var dir string
var workers int

type Result struct
file string
sha256 [32]byte


func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,


wg.Done()


func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func main()

flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()

fmt.Printf("Searching in %s using %d workers...n", dir, workers)

input := make(chan string)
results := make(chan *Result)

wg := sync.WaitGroup
wg.Add(workers)

for i := 0; i < workers; i++
go worker(input, results, &wg)


go search(input)
go func()
wg.Wait()
close(results)
()

counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)













share|improve this question












share|improve this question




share|improve this question








edited Feb 27 at 15:31









200_success

123k14143401




123k14143401









asked Feb 10 at 0:52









Thibaut D.

757




757











  • hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
    – Randy Howard
    Feb 10 at 3:39










  • Thanks, I did it and added flags to set directory and number of workers :-)
    – Thibaut D.
    Feb 10 at 9:24
















  • hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
    – Randy Howard
    Feb 10 at 3:39










  • Thanks, I did it and added flags to set directory and number of workers :-)
    – Thibaut D.
    Feb 10 at 9:24















hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
– Randy Howard
Feb 10 at 3:39




hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
– Randy Howard
Feb 10 at 3:39












Thanks, I did it and added flags to set directory and number of workers :-)
– Thibaut D.
Feb 10 at 9:24




Thanks, I did it and added flags to set directory and number of workers :-)
– Thibaut D.
Feb 10 at 9:24










1 Answer
1






active

oldest

votes

















up vote
5
down vote



accepted
+50










1. declare all 'var' at once



instead of



var dir string
var workers int


you can do



var (
dir string
workers int
)


or even better, use local var instead of global var irectly in your main() function



dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")


2. Make sure that arguments are valid



if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:



if workers <= 0 
fmt.Printf("workers has to be > 0, was %d", workers)



3. Improve hash computing:



First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:



h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()



Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:



results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),



4. Always specify the channel direction when you can



From the golang specifications:




A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.



ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .



The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.




Specify the channel direction helps understand what a method is doing




Here is the new version of the code:



package main

import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)

type Result struct
file string
sha256 string


func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),

h.Reset()

wg.Done()


func run(dir string, workers int) (map[string]string, error)

input := make(chan string)
go search(dir, input)

counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)

()

var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)

wg.Wait()
close(results)

return counter, nil


func main()

dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()

if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)

fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)

counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)







Possible improvements:



Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)






share|improve this answer























  • Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
    – oliverpool
    Feb 22 at 17:59










  • @oliverpool you're right, thanks for noticing this! I've updated the code
    – felix
    Feb 27 at 13:06










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187231%2fgo-program-to-find-duplicated-files-in-a-directory-recursively%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
5
down vote



accepted
+50










1. declare all 'var' at once



instead of



var dir string
var workers int


you can do



var (
dir string
workers int
)


or even better, use local var instead of global var irectly in your main() function



dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")


2. Make sure that arguments are valid



if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:



if workers <= 0 
fmt.Printf("workers has to be > 0, was %d", workers)



3. Improve hash computing:



First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:



h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()



Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:



results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),



4. Always specify the channel direction when you can



From the golang specifications:




A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.



ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .



The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.




Specify the channel direction helps understand what a method is doing




Here is the new version of the code:



package main

import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)

type Result struct
file string
sha256 string


func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),

h.Reset()

wg.Done()


func run(dir string, workers int) (map[string]string, error)

input := make(chan string)
go search(dir, input)

counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)

()

var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)

wg.Wait()
close(results)

return counter, nil


func main()

dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()

if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)

fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)

counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)







Possible improvements:



Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)






share|improve this answer























  • Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
    – oliverpool
    Feb 22 at 17:59










  • @oliverpool you're right, thanks for noticing this! I've updated the code
    – felix
    Feb 27 at 13:06














up vote
5
down vote



accepted
+50










1. declare all 'var' at once



instead of



var dir string
var workers int


you can do



var (
dir string
workers int
)


or even better, use local var instead of global var irectly in your main() function



dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")


2. Make sure that arguments are valid



if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:



if workers <= 0 
fmt.Printf("workers has to be > 0, was %d", workers)



3. Improve hash computing:



First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:



h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()



Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:



results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),



4. Always specify the channel direction when you can



From the golang specifications:




A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.



ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .



The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.




Specify the channel direction helps understand what a method is doing




Here is the new version of the code:



package main

import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)

type Result struct
file string
sha256 string


func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),

h.Reset()

wg.Done()


func run(dir string, workers int) (map[string]string, error)

input := make(chan string)
go search(dir, input)

counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)

()

var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)

wg.Wait()
close(results)

return counter, nil


func main()

dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()

if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)

fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)

counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)







Possible improvements:



Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)






share|improve this answer























  • Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
    – oliverpool
    Feb 22 at 17:59










  • @oliverpool you're right, thanks for noticing this! I've updated the code
    – felix
    Feb 27 at 13:06












up vote
5
down vote



accepted
+50







up vote
5
down vote



accepted
+50




+50




1. declare all 'var' at once



instead of



var dir string
var workers int


you can do



var (
dir string
workers int
)


or even better, use local var instead of global var irectly in your main() function



dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")


2. Make sure that arguments are valid



if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:



if workers <= 0 
fmt.Printf("workers has to be > 0, was %d", workers)



3. Improve hash computing:



First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:



h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()



Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:



results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),



4. Always specify the channel direction when you can



From the golang specifications:




A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.



ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .



The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.




Specify the channel direction helps understand what a method is doing




Here is the new version of the code:



package main

import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)

type Result struct
file string
sha256 string


func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),

h.Reset()

wg.Done()


func run(dir string, workers int) (map[string]string, error)

input := make(chan string)
go search(dir, input)

counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)

()

var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)

wg.Wait()
close(results)

return counter, nil


func main()

dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()

if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)

fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)

counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)







Possible improvements:



Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)






share|improve this answer















1. declare all 'var' at once



instead of



var dir string
var workers int


you can do



var (
dir string
workers int
)


or even better, use local var instead of global var irectly in your main() function



dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")


2. Make sure that arguments are valid



if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:



if workers <= 0 
fmt.Printf("workers has to be > 0, was %d", workers)



3. Improve hash computing:



First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:



h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()



Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:



results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),



4. Always specify the channel direction when you can



From the golang specifications:




A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.



ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .



The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.




Specify the channel direction helps understand what a method is doing




Here is the new version of the code:



package main

import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)

type Result struct
file string
sha256 string


func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path

return nil
)
close(input)


func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue

if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue

f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),

h.Reset()

wg.Done()


func run(dir string, workers int) (map[string]string, error)

input := make(chan string)
go search(dir, input)

counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)

()

var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)

wg.Wait()
close(results)

return counter, nil


func main()

dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()

if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)

fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)

counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)


for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)







Possible improvements:



Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)







share|improve this answer















share|improve this answer



share|improve this answer








edited Feb 27 at 13:05


























answered Feb 14 at 12:10









felix

67829




67829











  • Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
    – oliverpool
    Feb 22 at 17:59










  • @oliverpool you're right, thanks for noticing this! I've updated the code
    – felix
    Feb 27 at 13:06
















  • Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
    – oliverpool
    Feb 22 at 17:59










  • @oliverpool you're right, thanks for noticing this! I've updated the code
    – felix
    Feb 27 at 13:06















Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
– oliverpool
Feb 22 at 17:59




Instead of ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!
– oliverpool
Feb 22 at 17:59












@oliverpool you're right, thanks for noticing this! I've updated the code
– felix
Feb 27 at 13:06




@oliverpool you're right, thanks for noticing this! I've updated the code
– felix
Feb 27 at 13:06












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187231%2fgo-program-to-find-duplicated-files-in-a-directory-recursively%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods