Improvement in file management system based on the file names

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












Context:



I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.



Let's consider the following class Signal which defines a multiphasic signal:



class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases

# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]


Based on the list of signals, a file_name is computed:



def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])

return "A_F_P.pkl".format(amplitudes, frequencies, phases)


For the example above, it would return:



 "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"


As you can see, I'm pickling the files (with _pickle). Let's now believe that hundreds of files have been stored to the folder: folder. To check if a specific combination of signals has been computed I'm using:



import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))


I'm using itertools since the permutations are relevant, i.e.:



signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]


Issue:



To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.



# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ± 842 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Question:



How could I make a similar system but efficient?



Thanks for the help!







share|improve this question



























    up vote
    2
    down vote

    favorite












    Context:



    I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.



    Let's consider the following class Signal which defines a multiphasic signal:



    class Signal:
    def __init__(self, amp, fq, phases):
    self.amp = amp
    self.fq = fq
    self.phases = phases

    # List of signal objects:
    signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]


    Based on the list of signals, a file_name is computed:



    def file_name(signals):
    amplitudes = tuple([S.amp for S in signals])
    frequencies = tuple([S.fq for S in signals])
    phases = tuple([S.phases for S in signals])

    return "A_F_P.pkl".format(amplitudes, frequencies, phases)


    For the example above, it would return:



     "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"


    As you can see, I'm pickling the files (with _pickle). Let's now believe that hundreds of files have been stored to the folder: folder. To check if a specific combination of signals has been computed I'm using:



    import itertools
    def is_computed(files, signals):
    """
    Check if the signals are already computed
    """
    return any(file_name(elt) in files for elt in itertools.permutations(signals))


    I'm using itertools since the permutations are relevant, i.e.:



    signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
    # IS THE SAME AS:
    signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]


    Issue:



    To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.



    # Folder of 26K files with the size from 1 kB to hundreds of MBs
    In: %timeit os.listdir(folder)
    Out: 3.75 s ± 842 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    Question:



    How could I make a similar system but efficient?



    Thanks for the help!







    share|improve this question























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      Context:



      I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.



      Let's consider the following class Signal which defines a multiphasic signal:



      class Signal:
      def __init__(self, amp, fq, phases):
      self.amp = amp
      self.fq = fq
      self.phases = phases

      # List of signal objects:
      signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]


      Based on the list of signals, a file_name is computed:



      def file_name(signals):
      amplitudes = tuple([S.amp for S in signals])
      frequencies = tuple([S.fq for S in signals])
      phases = tuple([S.phases for S in signals])

      return "A_F_P.pkl".format(amplitudes, frequencies, phases)


      For the example above, it would return:



       "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"


      As you can see, I'm pickling the files (with _pickle). Let's now believe that hundreds of files have been stored to the folder: folder. To check if a specific combination of signals has been computed I'm using:



      import itertools
      def is_computed(files, signals):
      """
      Check if the signals are already computed
      """
      return any(file_name(elt) in files for elt in itertools.permutations(signals))


      I'm using itertools since the permutations are relevant, i.e.:



      signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
      # IS THE SAME AS:
      signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]


      Issue:



      To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.



      # Folder of 26K files with the size from 1 kB to hundreds of MBs
      In: %timeit os.listdir(folder)
      Out: 3.75 s ± 842 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


      Question:



      How could I make a similar system but efficient?



      Thanks for the help!







      share|improve this question













      Context:



      I have a program which stores data to the disk. The data is then reprocessed during some of the iterations. Thus, it needs to store, search and load set of data.



      Let's consider the following class Signal which defines a multiphasic signal:



      class Signal:
      def __init__(self, amp, fq, phases):
      self.amp = amp
      self.fq = fq
      self.phases = phases

      # List of signal objects:
      signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]


      Based on the list of signals, a file_name is computed:



      def file_name(signals):
      amplitudes = tuple([S.amp for S in signals])
      frequencies = tuple([S.fq for S in signals])
      phases = tuple([S.phases for S in signals])

      return "A_F_P.pkl".format(amplitudes, frequencies, phases)


      For the example above, it would return:



       "A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"


      As you can see, I'm pickling the files (with _pickle). Let's now believe that hundreds of files have been stored to the folder: folder. To check if a specific combination of signals has been computed I'm using:



      import itertools
      def is_computed(files, signals):
      """
      Check if the signals are already computed
      """
      return any(file_name(elt) in files for elt in itertools.permutations(signals))


      I'm using itertools since the permutations are relevant, i.e.:



      signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
      # IS THE SAME AS:
      signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]


      Issue:



      To get the list of files past to is_computed(), I'm using: files = os.listdir(folder) which becomes fairly inefficient as the number of files grows up.



      # Folder of 26K files with the size from 1 kB to hundreds of MBs
      In: %timeit os.listdir(folder)
      Out: 3.75 s ± 842 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


      Question:



      How could I make a similar system but efficient?



      Thanks for the help!









      share|improve this question












      share|improve this question




      share|improve this question








      edited Jun 6 at 11:28
























      asked Jun 6 at 8:18









      Mathieu

      1357




      1357




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          7
          down vote



          accepted










          It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:



          def canonical_filename(signals):
          "Return canonical filename for a collection of signals."
          return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))


          Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:



          def is_computed(signals):
          "Return True if the file for signals exists, False otherwise."
          return os.path.isfile(canonical_filename(signals))


          I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:



          def file_name(signals):
          "Return filename for a list of signals."
          amplitudes = ','.join(str(s.amp) for s in signals)
          frequencies = ','.join(str(s.fq) for s in signals)
          phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
          return f'Aamplitudes_Ffrequencies_Pphases.pkl'





          share|improve this answer





















          • Well, that's a great way to do it... Thanks. I'll see how well it works!
            – Mathieu
            Jun 6 at 11:30










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195938%2fimprovement-in-file-management-system-based-on-the-file-names%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          7
          down vote



          accepted










          It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:



          def canonical_filename(signals):
          "Return canonical filename for a collection of signals."
          return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))


          Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:



          def is_computed(signals):
          "Return True if the file for signals exists, False otherwise."
          return os.path.isfile(canonical_filename(signals))


          I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:



          def file_name(signals):
          "Return filename for a list of signals."
          amplitudes = ','.join(str(s.amp) for s in signals)
          frequencies = ','.join(str(s.fq) for s in signals)
          phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
          return f'Aamplitudes_Ffrequencies_Pphases.pkl'





          share|improve this answer





















          • Well, that's a great way to do it... Thanks. I'll see how well it works!
            – Mathieu
            Jun 6 at 11:30














          up vote
          7
          down vote



          accepted










          It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:



          def canonical_filename(signals):
          "Return canonical filename for a collection of signals."
          return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))


          Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:



          def is_computed(signals):
          "Return True if the file for signals exists, False otherwise."
          return os.path.isfile(canonical_filename(signals))


          I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:



          def file_name(signals):
          "Return filename for a list of signals."
          amplitudes = ','.join(str(s.amp) for s in signals)
          frequencies = ','.join(str(s.fq) for s in signals)
          phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
          return f'Aamplitudes_Ffrequencies_Pphases.pkl'





          share|improve this answer





















          • Well, that's a great way to do it... Thanks. I'll see how well it works!
            – Mathieu
            Jun 6 at 11:30












          up vote
          7
          down vote



          accepted







          up vote
          7
          down vote



          accepted






          It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:



          def canonical_filename(signals):
          "Return canonical filename for a collection of signals."
          return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))


          Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:



          def is_computed(signals):
          "Return True if the file for signals exists, False otherwise."
          return os.path.isfile(canonical_filename(signals))


          I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:



          def file_name(signals):
          "Return filename for a list of signals."
          amplitudes = ','.join(str(s.amp) for s in signals)
          frequencies = ','.join(str(s.fq) for s in signals)
          phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
          return f'Aamplitudes_Ffrequencies_Pphases.pkl'





          share|improve this answer













          It would be better to design the system so that each collection of signals has a canonical filename, regardless of the order of the signals in the collection. This is most easily done by sorting the signals in the collection:



          def canonical_filename(signals):
          "Return canonical filename for a collection of signals."
          return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))


          Since there is now only one filename for each collection of signals, there is no need to list the directory or generate the permutations:



          def is_computed(signals):
          "Return True if the file for signals exists, False otherwise."
          return os.path.isfile(canonical_filename(signals))


          I recommend designing the filename so that it does not contain shell meta-characters like spaces, parentheses, and brackets. This is a convenience that means we don't need to quote the filenames when manipulating them via the shell. For example:



          def file_name(signals):
          "Return filename for a list of signals."
          amplitudes = ','.join(str(s.amp) for s in signals)
          frequencies = ','.join(str(s.fq) for s in signals)
          phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
          return f'Aamplitudes_Ffrequencies_Pphases.pkl'






          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Jun 6 at 11:03









          Gareth Rees

          41.1k394166




          41.1k394166











          • Well, that's a great way to do it... Thanks. I'll see how well it works!
            – Mathieu
            Jun 6 at 11:30
















          • Well, that's a great way to do it... Thanks. I'll see how well it works!
            – Mathieu
            Jun 6 at 11:30















          Well, that's a great way to do it... Thanks. I'll see how well it works!
          – Mathieu
          Jun 6 at 11:30




          Well, that's a great way to do it... Thanks. I'll see how well it works!
          – Mathieu
          Jun 6 at 11:30












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195938%2fimprovement-in-file-management-system-based-on-the-file-names%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation