Go program to find duplicated files in a directory (recursively)

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.
It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.
Here is how it works:
- n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named
inputin my code. - 1 goroutine is spawned to recursively search for files in the direvtory, and populate the
inputchannel with file names. - The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].
Finally we just display the duplicates.
Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.
EDIT: Improved my initial code with flags and error management.
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)
var dir string
var workers int
type Result struct
file string
sha256 [32]byte
func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,
wg.Done()
func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func main()
flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()
fmt.Printf("Searching in %s using %d workers...n", dir, workers)
input := make(chan string)
results := make(chan *Result)
wg := sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go worker(input, results, &wg)
go search(input)
go func()
wg.Wait()
close(results)
()
counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)
beginner file-system go
add a comment |Â
up vote
2
down vote
favorite
This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.
It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.
Here is how it works:
- n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named
inputin my code. - 1 goroutine is spawned to recursively search for files in the direvtory, and populate the
inputchannel with file names. - The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].
Finally we just display the duplicates.
Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.
EDIT: Improved my initial code with flags and error management.
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)
var dir string
var workers int
type Result struct
file string
sha256 [32]byte
func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,
wg.Done()
func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func main()
flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()
fmt.Printf("Searching in %s using %d workers...n", dir, workers)
input := make(chan string)
results := make(chan *Result)
wg := sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go worker(input, results, &wg)
go search(input)
go func()
wg.Wait()
close(results)
()
counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)
beginner file-system go
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.
It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.
Here is how it works:
- n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named
inputin my code. - 1 goroutine is spawned to recursively search for files in the direvtory, and populate the
inputchannel with file names. - The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].
Finally we just display the duplicates.
Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.
EDIT: Improved my initial code with flags and error management.
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)
var dir string
var workers int
type Result struct
file string
sha256 [32]byte
func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,
wg.Done()
func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func main()
flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()
fmt.Printf("Searching in %s using %d workers...n", dir, workers)
input := make(chan string)
results := make(chan *Result)
wg := sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go worker(input, results, &wg)
go search(input)
go func()
wg.Wait()
close(results)
()
counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)
beginner file-system go
This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.
It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.
Here is how it works:
- n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named
inputin my code. - 1 goroutine is spawned to recursively search for files in the direvtory, and populate the
inputchannel with file names. - The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].
Finally we just display the duplicates.
Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.
EDIT: Improved my initial code with flags and error management.
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)
var dir string
var workers int
type Result struct
file string
sha256 [32]byte
func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup)
for file := range input
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err = io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result
file: file,
sha256: sum,
wg.Done()
func search(input chan string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func main()
flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()
fmt.Printf("Searching in %s using %d workers...n", dir, workers)
input := make(chan string)
results := make(chan *Result)
wg := sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go worker(input, results, &wg)
go search(input)
go func()
wg.Wait()
close(results)
()
counter := make(map[[32]byte]string)
for result := range results
counter[result.sha256] = append(counter[result.sha256], result.file)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %s: n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files
fmt.Println("-> ", f)
beginner file-system go
edited Feb 27 at 15:31
200_success
123k14143401
123k14143401
asked Feb 10 at 0:52
Thibaut D.
757
757
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24
add a comment |Â
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
5
down vote
accepted
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main() function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:
if workers <= 0
fmt.Printf("workers has to be > 0, was %d", workers)
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:
h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()
Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct
file string
sha256 string
func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
h.Reset()
wg.Done()
func run(dir string, workers int) (map[string]string, error)
input := make(chan string)
go search(dir, input)
counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)
()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)
wg.Wait()
close(results)
return counter, nil
func main()
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)
fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)
Instead ofioutil.ReadFile, you could follow the example (file) of golang.org and useOpenfollowed byio.Copy. Using this, the whole file does not need to be loaded in memory!
â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main() function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:
if workers <= 0
fmt.Printf("workers has to be > 0, was %d", workers)
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:
h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()
Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct
file string
sha256 string
func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
h.Reset()
wg.Done()
func run(dir string, workers int) (map[string]string, error)
input := make(chan string)
go search(dir, input)
counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)
()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)
wg.Wait()
close(results)
return counter, nil
func main()
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)
fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)
Instead ofioutil.ReadFile, you could follow the example (file) of golang.org and useOpenfollowed byio.Copy. Using this, the whole file does not need to be loaded in memory!
â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
add a comment |Â
up vote
5
down vote
accepted
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main() function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:
if workers <= 0
fmt.Printf("workers has to be > 0, was %d", workers)
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:
h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()
Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct
file string
sha256 string
func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
h.Reset()
wg.Done()
func run(dir string, workers int) (map[string]string, error)
input := make(chan string)
go search(dir, input)
counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)
()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)
wg.Wait()
close(results)
return counter, nil
func main()
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)
fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)
Instead ofioutil.ReadFile, you could follow the example (file) of golang.org and useOpenfollowed byio.Copy. Using this, the whole file does not need to be loaded in memory!
â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main() function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:
if workers <= 0
fmt.Printf("workers has to be > 0, was %d", workers)
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:
h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()
Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct
file string
sha256 string
func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
h.Reset()
wg.Done()
func run(dir string, workers int) (map[string]string, error)
input := make(chan string)
go search(dir, input)
counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)
()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)
wg.Wait()
close(results)
return counter, nil
func main()
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)
fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main() function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker is <= 0, the program will panic. A little check after flag.Parse() could prevent this:
if workers <= 0
fmt.Printf("workers has to be > 0, was %d", workers)
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset() on it after each file:
h := sha256.New()
for file := range input
...
results <- &Result...
h.Reset()
Also, the hash of each file could be stored as a string instead of a [32]byte to avoid some operations:
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions
to communicate by sending and receiving values of a specified element
type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or
receive. If no direction is given, the channel is bidirectional. A
channel may be constrained only to send or only to receive by
conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct
file string
sha256 string
func search(dir string, input chan<- string)
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error
if err != nil
fmt.Fprintln(os.Stderr, err)
else if info.Mode().IsRegular()
input <- path
return nil
)
close(input)
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup)
h := sha256.New()
for file := range input
f, err := os.Open("file.txt")
if err != nil
fmt.Fprintln(os.Stderr, err)
continue
if _, err := io.Copy(h, f); err != nil
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
f.Close()
results <- &Result
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
h.Reset()
wg.Done()
func run(dir string, workers int) (map[string]string, error)
input := make(chan string)
go search(dir, input)
counter := make(map[string]string)
results := make(chan *Result)
go func()
for r := range results
counter[r.sha256] = append(counter[r.sha256], r.file)
()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++
go startWorker(input, results, &wg)
wg.Wait()
close(results)
return counter, nil
func main()
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0
fmt.Printf("workers has to be > 0, was %d n", workers)
os.Exit(1)
fmt.Printf("Searching in %s using %d workers...n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil
fmt.Printf("failed! %vn", err)
os.Exit(1)
for sha, files := range counter
if len(files) > 1
fmt.Printf("Found %d duplicates for %v: n", len(files), sha)
for _, f := range files
fmt.Println("-> ", f)
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr. It may be better to return this error and then call os.Exit(1)
edited Feb 27 at 13:05
answered Feb 14 at 12:10
felix
67829
67829
Instead ofioutil.ReadFile, you could follow the example (file) of golang.org and useOpenfollowed byio.Copy. Using this, the whole file does not need to be loaded in memory!
â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
add a comment |Â
Instead ofioutil.ReadFile, you could follow the example (file) of golang.org and useOpenfollowed byio.Copy. Using this, the whole file does not need to be loaded in memory!
â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
Instead of
ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!â oliverpool
Feb 22 at 17:59
Instead of
ioutil.ReadFile, you could follow the example (file) of golang.org and use Open followed by io.Copy. Using this, the whole file does not need to be loaded in memory!â oliverpool
Feb 22 at 17:59
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
@oliverpool you're right, thanks for noticing this! I've updated the code
â felix
Feb 27 at 13:06
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f187231%2fgo-program-to-find-duplicated-files-in-a-directory-recursively%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately.
â Randy Howard
Feb 10 at 3:39
Thanks, I did it and added flags to set directory and number of workers :-)
â Thibaut D.
Feb 10 at 9:24