Improvement in file management system based on the file names
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
Context:
I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.
Let's consider the following class Signal
which defines a multiphasic signal:
class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases
# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
Based on the list of signals, a file_name is computed:
def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])
return "A_F_P.pkl".format(amplitudes, frequencies, phases)
For the example above, it would return:
"A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"
As you can see, I'm pickling the files (with _pickle
). Let's now believe that hundreds of files have been stored to the folder: folder
. To check if a specific combination of signals has been computed I'm using:
import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))
I'm using itertools
since the permutations are relevant, i.e.:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]
Issue:
To get the list of files past to is_computed()
, I'm using: files = os.listdir(folder)
which becomes fairly inefficient as the number of files grows up.
# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ñ 842 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
Question:
How could I make a similar system but efficient?
Thanks for the help!
python performance strings database
add a comment |Â
up vote
2
down vote
favorite
Context:
I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.
Let's consider the following class Signal
which defines a multiphasic signal:
class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases
# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
Based on the list of signals, a file_name is computed:
def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])
return "A_F_P.pkl".format(amplitudes, frequencies, phases)
For the example above, it would return:
"A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"
As you can see, I'm pickling the files (with _pickle
). Let's now believe that hundreds of files have been stored to the folder: folder
. To check if a specific combination of signals has been computed I'm using:
import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))
I'm using itertools
since the permutations are relevant, i.e.:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]
Issue:
To get the list of files past to is_computed()
, I'm using: files = os.listdir(folder)
which becomes fairly inefficient as the number of files grows up.
# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ñ 842 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
Question:
How could I make a similar system but efficient?
Thanks for the help!
python performance strings database
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Context:
I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.
Let's consider the following class Signal
which defines a multiphasic signal:
class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases
# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
Based on the list of signals, a file_name is computed:
def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])
return "A_F_P.pkl".format(amplitudes, frequencies, phases)
For the example above, it would return:
"A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"
As you can see, I'm pickling the files (with _pickle
). Let's now believe that hundreds of files have been stored to the folder: folder
. To check if a specific combination of signals has been computed I'm using:
import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))
I'm using itertools
since the permutations are relevant, i.e.:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]
Issue:
To get the list of files past to is_computed()
, I'm using: files = os.listdir(folder)
which becomes fairly inefficient as the number of files grows up.
# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ñ 842 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
Question:
How could I make a similar system but efficient?
Thanks for the help!
python performance strings database
Context:
I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.
Let's consider the following class Signal
which defines a multiphasic signal:
class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases
# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
Based on the list of signals, a file_name is computed:
def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])
return "A_F_P.pkl".format(amplitudes, frequencies, phases)
For the example above, it would return:
"A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"
As you can see, I'm pickling the files (with _pickle
). Let's now believe that hundreds of files have been stored to the folder: folder
. To check if a specific combination of signals has been computed I'm using:
import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))
I'm using itertools
since the permutations are relevant, i.e.:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]
Issue:
To get the list of files past to is_computed()
, I'm using: files = os.listdir(folder)
which becomes fairly inefficient as the number of files grows up.
# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ñ 842 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
Question:
How could I make a similar system but efficient?
Thanks for the help!
python performance strings database
edited Jun 6 at 11:28
asked Jun 6 at 8:18
Mathieu
1357
1357
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
7
down vote
accepted
It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))
Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))
I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'Aamplitudes_Ffrequencies_Pphases.pkl'
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))
Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))
I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'Aamplitudes_Ffrequencies_Pphases.pkl'
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
add a comment |Â
up vote
7
down vote
accepted
It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))
Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))
I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'Aamplitudes_Ffrequencies_Pphases.pkl'
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
add a comment |Â
up vote
7
down vote
accepted
up vote
7
down vote
accepted
It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))
Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))
I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'Aamplitudes_Ffrequencies_Pphases.pkl'
It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))
Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))
I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'Aamplitudes_Ffrequencies_Pphases.pkl'
answered Jun 6 at 11:03
Gareth Rees
41.1k394166
41.1k394166
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
add a comment |Â
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
Well, that's a great way to do it... Thanks. I'll see how well it works!
â Mathieu
Jun 6 at 11:30
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195938%2fimprovement-in-file-management-system-based-on-the-file-names%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password