Improvement in file management system based on the file names

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

Context:

I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.

Let's consider the following class Signal which defines a multiphasic signal:

class Signal:
 def __init__(self, amp, fq, phases):
 self.amp = amp
 self.fq = fq
 self.phases = phases

# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]

Based on the list of signals, a file_name is computed:

def file_name(signals):
 amplitudes = tuple([S.amp for S in signals])
 frequencies = tuple([S.fq for S in signals])
 phases = tuple([S.phases for S in signals])

 return "A_F_P.pkl".format(amplitudes, frequencies, phases)

For the example above, it would return:

 "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"

As you can see, I'm pickling the files (with _pickle). Let's now believe that hundreds of files have been stored to the folder: folder. To check if a specific combination of signals has been computed I'm using:

import itertools
def is_computed(files, signals):
 """
 Check if the signals are already computed
 """
 return any(file_name(elt) in files for elt in itertools.permutations(signals))

I'm using itertools since the permutations are relevant, i.e.:

signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]

Issue:

To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.

# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s Ã‚Â± 842 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

Question:

How could I make a similar system but efficient?

Thanks for the help!

edited Jun 6 at 11:28

asked Jun 6 at 8:18

Mathieu

1357

add a commentÂ |Â

up vote
2
down vote

favorite

Context:

I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.

Let's consider the following class Signal which defines a multiphasic signal:

class Signal:
 def __init__(self, amp, fq, phases):
 self.amp = amp
 self.fq = fq
 self.phases = phases

# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]

Based on the list of signals, a file_name is computed:

def file_name(signals):
 amplitudes = tuple([S.amp for S in signals])
 frequencies = tuple([S.fq for S in signals])
 phases = tuple([S.phases for S in signals])

 return "A_F_P.pkl".format(amplitudes, frequencies, phases)

For the example above, it would return:

 "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"

import itertools
def is_computed(files, signals):
 """
 Check if the signals are already computed
 """
 return any(file_name(elt) in files for elt in itertools.permutations(signals))

I'm using itertools since the permutations are relevant, i.e.:

signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]

Issue:

To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.

# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s Ã‚Â± 842 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

Question:

How could I make a similar system but efficient?

Thanks for the help!

edited Jun 6 at 11:28

asked Jun 6 at 8:18

Mathieu

1357

add a commentÂ |Â

up vote
2
down vote

favorite

Context:

I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.

Let's consider the following class Signal which defines a multiphasic signal:

class Signal:
 def __init__(self, amp, fq, phases):
 self.amp = amp
 self.fq = fq
 self.phases = phases

# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]

Based on the list of signals, a file_name is computed:

def file_name(signals):
 amplitudes = tuple([S.amp for S in signals])
 frequencies = tuple([S.fq for S in signals])
 phases = tuple([S.phases for S in signals])

 return "A_F_P.pkl".format(amplitudes, frequencies, phases)

For the example above, it would return:

 "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"

import itertools
def is_computed(files, signals):
 """
 Check if the signals are already computed
 """
 return any(file_name(elt) in files for elt in itertools.permutations(signals))

I'm using itertools since the permutations are relevant, i.e.:

signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]

Issue:

To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.

# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s Ã‚Â± 842 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

Question:

How could I make a similar system but efficient?

Thanks for the help!

edited Jun 6 at 11:28

asked Jun 6 at 8:18

Mathieu

1357

Context:

I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.

Let's consider the following class Signal which defines a multiphasic signal:

class Signal:
 def __init__(self, amp, fq, phases):
 self.amp = amp
 self.fq = fq
 self.phases = phases

# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]

Based on the list of signals, a file_name is computed:

def file_name(signals):
 amplitudes = tuple([S.amp for S in signals])
 frequencies = tuple([S.fq for S in signals])
 phases = tuple([S.phases for S in signals])

 return "A_F_P.pkl".format(amplitudes, frequencies, phases)

For the example above, it would return:

 "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"

import itertools
def is_computed(files, signals):
 """
 Check if the signals are already computed
 """
 return any(file_name(elt) in files for elt in itertools.permutations(signals))

I'm using itertools since the permutations are relevant, i.e.:

signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]

Issue:

To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.

# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s Ã‚Â± 842 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

Question:

How could I make a similar system but efficient?

Thanks for the help!

edited Jun 6 at 11:28

asked Jun 6 at 8:18

Mathieu

1357

edited Jun 6 at 11:28

asked Jun 6 at 8:18

Mathieu

1357

asked Jun 6 at 8:18

Mathieu

1357

asked Jun 6 at 8:18

Mathieu

1357

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
7
down vote

accepted

It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:

def canonical_filename(signals):
 "Return canonical filename for a collection of signals."
 return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))

Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:

def is_computed(signals):
 "Return True if the file for signals exists, False otherwise."
 return os.path.isfile(canonical_filename(signals))

I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:

def file_name(signals):
 "Return filename for a list of signals."
 amplitudes = ','.join(str(s.amp) for s in signals)
 frequencies = ','.join(str(s.fq) for s in signals)
 phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
 return f'Aamplitudes_Ffrequencies_Pphases.pkl'

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195938%2fimprovement-in-file-management-system-based-on-the-file-names%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
7
down vote

accepted

def canonical_filename(signals):
 "Return canonical filename for a collection of signals."
 return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))

Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:

def is_computed(signals):
 "Return True if the file for signals exists, False otherwise."
 return os.path.isfile(canonical_filename(signals))

def file_name(signals):
 "Return filename for a list of signals."
 amplitudes = ','.join(str(s.amp) for s in signals)
 frequencies = ','.join(str(s.fq) for s in signals)
 phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
 return f'Aamplitudes_Ffrequencies_Pphases.pkl'

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

add a commentÂ |Â

up vote
7
down vote

accepted

def canonical_filename(signals):
 "Return canonical filename for a collection of signals."
 return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))

Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:

def is_computed(signals):
 "Return True if the file for signals exists, False otherwise."
 return os.path.isfile(canonical_filename(signals))

def file_name(signals):
 "Return filename for a list of signals."
 amplitudes = ','.join(str(s.amp) for s in signals)
 frequencies = ','.join(str(s.fq) for s in signals)
 phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
 return f'Aamplitudes_Ffrequencies_Pphases.pkl'

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

add a commentÂ |Â

up vote
7
down vote

accepted

def canonical_filename(signals):
 "Return canonical filename for a collection of signals."
 return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))

Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:

def is_computed(signals):
 "Return True if the file for signals exists, False otherwise."
 return os.path.isfile(canonical_filename(signals))

def file_name(signals):
 "Return filename for a list of signals."
 amplitudes = ','.join(str(s.amp) for s in signals)
 frequencies = ','.join(str(s.fq) for s in signals)
 phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
 return f'Aamplitudes_Ffrequencies_Pphases.pkl'

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

def canonical_filename(signals):
 "Return canonical filename for a collection of signals."
 return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))

Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:

def is_computed(signals):
 "Return True if the file for signals exists, False otherwise."
 return os.path.isfile(canonical_filename(signals))

def file_name(signals):
 "Return filename for a list of signals."
 amplitudes = ','.join(str(s.amp) for s in signals)
 frequencies = ','.join(str(s.fq) for s in signals)
 phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
 return f'Aamplitudes_Ffrequencies_Pphases.pkl'

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

answered Jun 6 at 11:03

Gareth Rees

41.1k394166

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

add a commentÂ |Â

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

Well, that's a great way to do it... Thanks. I'll see how well it works!
â€“Â Mathieu
Jun 6 at 11:30

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr