Splitting URLs into hierarchy-level directories

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












The goal is to split a URL like http://q.com/a/b/c to:



['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']


My code:



import urlparse


def get_domain_with_protocol(url):
url_parts = urlparse.urlparse(url)
return "scheme://netloc/".format(
scheme=url_parts.scheme,
netloc=url_parts.netloc
)


def get_url_directories_combinations(url):
url = url.rstrip('/')
path = urlparse.urlparse(url).path[1:]
parts = path.split('/')
domain_with_protocol = get_domain_with_protocol(url)

url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
for index in range(len(parts))]
return url_combinations


print get_url_directories_combinations('http://example.com/a/b/c/')


I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.



How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.







share|improve this question

























    up vote
    4
    down vote

    favorite












    The goal is to split a URL like http://q.com/a/b/c to:



    ['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']


    My code:



    import urlparse


    def get_domain_with_protocol(url):
    url_parts = urlparse.urlparse(url)
    return "scheme://netloc/".format(
    scheme=url_parts.scheme,
    netloc=url_parts.netloc
    )


    def get_url_directories_combinations(url):
    url = url.rstrip('/')
    path = urlparse.urlparse(url).path[1:]
    parts = path.split('/')
    domain_with_protocol = get_domain_with_protocol(url)

    url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
    for index in range(len(parts))]
    return url_combinations


    print get_url_directories_combinations('http://example.com/a/b/c/')


    I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.



    How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.







    share|improve this question





















      up vote
      4
      down vote

      favorite









      up vote
      4
      down vote

      favorite











      The goal is to split a URL like http://q.com/a/b/c to:



      ['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']


      My code:



      import urlparse


      def get_domain_with_protocol(url):
      url_parts = urlparse.urlparse(url)
      return "scheme://netloc/".format(
      scheme=url_parts.scheme,
      netloc=url_parts.netloc
      )


      def get_url_directories_combinations(url):
      url = url.rstrip('/')
      path = urlparse.urlparse(url).path[1:]
      parts = path.split('/')
      domain_with_protocol = get_domain_with_protocol(url)

      url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
      for index in range(len(parts))]
      return url_combinations


      print get_url_directories_combinations('http://example.com/a/b/c/')


      I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.



      How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.







      share|improve this question











      The goal is to split a URL like http://q.com/a/b/c to:



      ['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']


      My code:



      import urlparse


      def get_domain_with_protocol(url):
      url_parts = urlparse.urlparse(url)
      return "scheme://netloc/".format(
      scheme=url_parts.scheme,
      netloc=url_parts.netloc
      )


      def get_url_directories_combinations(url):
      url = url.rstrip('/')
      path = urlparse.urlparse(url).path[1:]
      parts = path.split('/')
      domain_with_protocol = get_domain_with_protocol(url)

      url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
      for index in range(len(parts))]
      return url_combinations


      print get_url_directories_combinations('http://example.com/a/b/c/')


      I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.



      How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.









      share|improve this question










      share|improve this question




      share|improve this question









      asked Jun 7 at 10:13









      UnderpoweredNinja

      412




      412




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          All in all, this is not bad. A few remarks:



          Python 2



          Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:



          from __future__ import print_function
          try:
          import urlparse
          except ModuleNotFoundError:
          import urllib.parse as urlparse


          docstring



          Adding a docstring can better explain what the different methods do.



          range



          You don't need the range as list, so you might as well use xrange.



          To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.



          '/'



          You do a lot of operations with '/'.



          1. you remove a trailing / if it exists

          2. you remove the starting / from urlparse.urlparse(url).path with [1:]

          3. you add a trailing / in get_domain_with_protocol to the first part of the url

          You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).



          generator



          Instead of returning a list, you can also make a generator:



          for index in range(len(parts)):
          yield domain_with_protocol + '/'.join(parts[:index + 1])


          iteration 1



          In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)



          iteration 2



          I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
          In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:



          def accumulate_parts(parts, sep='/'):
          parts_iter = iter(parts)
          substring = next(parts_iter)
          yield substring
          for part in parts_iter:
          substring += sep + part
          yield substring

          def get_url_directories_accumulate(url):
          path = urlparse.urlparse(url).path
          parts = path.strip('/').split('/')
          domain_with_protocol = get_domain_with_protocol(url)
          for substring in accumulate_parts(parts):
          yield domain_with_protocol + substring


          timings



          I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.



          code



          Full code and timings can be found here.






          share|improve this answer























            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196017%2fsplitting-urls-into-hierarchy-level-directories%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            All in all, this is not bad. A few remarks:



            Python 2



            Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:



            from __future__ import print_function
            try:
            import urlparse
            except ModuleNotFoundError:
            import urllib.parse as urlparse


            docstring



            Adding a docstring can better explain what the different methods do.



            range



            You don't need the range as list, so you might as well use xrange.



            To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.



            '/'



            You do a lot of operations with '/'.



            1. you remove a trailing / if it exists

            2. you remove the starting / from urlparse.urlparse(url).path with [1:]

            3. you add a trailing / in get_domain_with_protocol to the first part of the url

            You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).



            generator



            Instead of returning a list, you can also make a generator:



            for index in range(len(parts)):
            yield domain_with_protocol + '/'.join(parts[:index + 1])


            iteration 1



            In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)



            iteration 2



            I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
            In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:



            def accumulate_parts(parts, sep='/'):
            parts_iter = iter(parts)
            substring = next(parts_iter)
            yield substring
            for part in parts_iter:
            substring += sep + part
            yield substring

            def get_url_directories_accumulate(url):
            path = urlparse.urlparse(url).path
            parts = path.strip('/').split('/')
            domain_with_protocol = get_domain_with_protocol(url)
            for substring in accumulate_parts(parts):
            yield domain_with_protocol + substring


            timings



            I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.



            code



            Full code and timings can be found here.






            share|improve this answer



























              up vote
              1
              down vote













              All in all, this is not bad. A few remarks:



              Python 2



              Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:



              from __future__ import print_function
              try:
              import urlparse
              except ModuleNotFoundError:
              import urllib.parse as urlparse


              docstring



              Adding a docstring can better explain what the different methods do.



              range



              You don't need the range as list, so you might as well use xrange.



              To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.



              '/'



              You do a lot of operations with '/'.



              1. you remove a trailing / if it exists

              2. you remove the starting / from urlparse.urlparse(url).path with [1:]

              3. you add a trailing / in get_domain_with_protocol to the first part of the url

              You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).



              generator



              Instead of returning a list, you can also make a generator:



              for index in range(len(parts)):
              yield domain_with_protocol + '/'.join(parts[:index + 1])


              iteration 1



              In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)



              iteration 2



              I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
              In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:



              def accumulate_parts(parts, sep='/'):
              parts_iter = iter(parts)
              substring = next(parts_iter)
              yield substring
              for part in parts_iter:
              substring += sep + part
              yield substring

              def get_url_directories_accumulate(url):
              path = urlparse.urlparse(url).path
              parts = path.strip('/').split('/')
              domain_with_protocol = get_domain_with_protocol(url)
              for substring in accumulate_parts(parts):
              yield domain_with_protocol + substring


              timings



              I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.



              code



              Full code and timings can be found here.






              share|improve this answer

























                up vote
                1
                down vote










                up vote
                1
                down vote









                All in all, this is not bad. A few remarks:



                Python 2



                Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:



                from __future__ import print_function
                try:
                import urlparse
                except ModuleNotFoundError:
                import urllib.parse as urlparse


                docstring



                Adding a docstring can better explain what the different methods do.



                range



                You don't need the range as list, so you might as well use xrange.



                To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.



                '/'



                You do a lot of operations with '/'.



                1. you remove a trailing / if it exists

                2. you remove the starting / from urlparse.urlparse(url).path with [1:]

                3. you add a trailing / in get_domain_with_protocol to the first part of the url

                You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).



                generator



                Instead of returning a list, you can also make a generator:



                for index in range(len(parts)):
                yield domain_with_protocol + '/'.join(parts[:index + 1])


                iteration 1



                In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)



                iteration 2



                I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
                In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:



                def accumulate_parts(parts, sep='/'):
                parts_iter = iter(parts)
                substring = next(parts_iter)
                yield substring
                for part in parts_iter:
                substring += sep + part
                yield substring

                def get_url_directories_accumulate(url):
                path = urlparse.urlparse(url).path
                parts = path.strip('/').split('/')
                domain_with_protocol = get_domain_with_protocol(url)
                for substring in accumulate_parts(parts):
                yield domain_with_protocol + substring


                timings



                I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.



                code



                Full code and timings can be found here.






                share|improve this answer















                All in all, this is not bad. A few remarks:



                Python 2



                Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:



                from __future__ import print_function
                try:
                import urlparse
                except ModuleNotFoundError:
                import urllib.parse as urlparse


                docstring



                Adding a docstring can better explain what the different methods do.



                range



                You don't need the range as list, so you might as well use xrange.



                To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.



                '/'



                You do a lot of operations with '/'.



                1. you remove a trailing / if it exists

                2. you remove the starting / from urlparse.urlparse(url).path with [1:]

                3. you add a trailing / in get_domain_with_protocol to the first part of the url

                You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).



                generator



                Instead of returning a list, you can also make a generator:



                for index in range(len(parts)):
                yield domain_with_protocol + '/'.join(parts[:index + 1])


                iteration 1



                In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)



                iteration 2



                I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
                In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:



                def accumulate_parts(parts, sep='/'):
                parts_iter = iter(parts)
                substring = next(parts_iter)
                yield substring
                for part in parts_iter:
                substring += sep + part
                yield substring

                def get_url_directories_accumulate(url):
                path = urlparse.urlparse(url).path
                parts = path.strip('/').split('/')
                domain_with_protocol = get_domain_with_protocol(url)
                for substring in accumulate_parts(parts):
                yield domain_with_protocol + substring


                timings



                I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.



                code



                Full code and timings can be found here.







                share|improve this answer















                share|improve this answer



                share|improve this answer








                edited Jul 8 at 12:48









                Daniel

                4,1132836




                4,1132836











                answered Jun 8 at 9:23









                Maarten Fabré

                3,204214




                3,204214






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196017%2fsplitting-urls-into-hierarchy-level-directories%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Greedy Best First Search implementation in Rust

                    Function to Return a JSON Like Objects Using VBA Collections and Arrays

                    C++11 CLH Lock Implementation