Splitting URLs into hierarchy-level directories

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

The goal is to split a URL like http://q.com/a/b/c to:

['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']

My code:

import urlparse


def get_domain_with_protocol(url):
 url_parts = urlparse.urlparse(url)
 return "scheme://netloc/".format(
 scheme=url_parts.scheme, 
 netloc=url_parts.netloc
 )


def get_url_directories_combinations(url):
 url = url.rstrip('/')
 path = urlparse.urlparse(url).path[1:]
 parts = path.split('/')
 domain_with_protocol = get_domain_with_protocol(url)

 url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
 for index in range(len(parts))]
 return url_combinations


print get_url_directories_combinations('http://example.com/a/b/c/')

I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.

How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.

asked Jun 7 at 10:13

UnderpoweredNinja

412

add a commentÂ |Â

up vote
4
down vote

favorite

The goal is to split a URL like http://q.com/a/b/c to:

['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']

My code:

import urlparse


def get_domain_with_protocol(url):
 url_parts = urlparse.urlparse(url)
 return "scheme://netloc/".format(
 scheme=url_parts.scheme, 
 netloc=url_parts.netloc
 )


def get_url_directories_combinations(url):
 url = url.rstrip('/')
 path = urlparse.urlparse(url).path[1:]
 parts = path.split('/')
 domain_with_protocol = get_domain_with_protocol(url)

 url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
 for index in range(len(parts))]
 return url_combinations


print get_url_directories_combinations('http://example.com/a/b/c/')

I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.

How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.

asked Jun 7 at 10:13

UnderpoweredNinja

412

add a commentÂ |Â

up vote
4
down vote

favorite

The goal is to split a URL like http://q.com/a/b/c to:

['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']

My code:

import urlparse


def get_domain_with_protocol(url):
 url_parts = urlparse.urlparse(url)
 return "scheme://netloc/".format(
 scheme=url_parts.scheme, 
 netloc=url_parts.netloc
 )


def get_url_directories_combinations(url):
 url = url.rstrip('/')
 path = urlparse.urlparse(url).path[1:]
 parts = path.split('/')
 domain_with_protocol = get_domain_with_protocol(url)

 url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
 for index in range(len(parts))]
 return url_combinations


print get_url_directories_combinations('http://example.com/a/b/c/')

I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.

How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.

asked Jun 7 at 10:13

UnderpoweredNinja

412

The goal is to split a URL like http://q.com/a/b/c to:

['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']

My code:

import urlparse


def get_domain_with_protocol(url):
 url_parts = urlparse.urlparse(url)
 return "scheme://netloc/".format(
 scheme=url_parts.scheme, 
 netloc=url_parts.netloc
 )


def get_url_directories_combinations(url):
 url = url.rstrip('/')
 path = urlparse.urlparse(url).path[1:]
 parts = path.split('/')
 domain_with_protocol = get_domain_with_protocol(url)

 url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
 for index in range(len(parts))]
 return url_combinations


print get_url_directories_combinations('http://example.com/a/b/c/')

I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink and posixpath etc. which can be used for path manipulation etc.

How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.

asked Jun 7 at 10:13

UnderpoweredNinja

412

asked Jun 7 at 10:13

UnderpoweredNinja

412

asked Jun 7 at 10:13

UnderpoweredNinja

412

asked Jun 7 at 10:13

UnderpoweredNinja

412

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

All in all, this is not bad. A few remarks:

Python 2

Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:

from __future__ import print_function
try:
 import urlparse
except ModuleNotFoundError:
 import urllib.parse as urlparse

docstring

Adding a docstring can better explain what the different methods do.

`range`

You don't need the range as list, so you might as well use xrange.

To make the code Python 3 compatible, you'd have to add xrange = range to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.

'/'

You do a lot of operations with '/'.

you remove a trailing / if it exists

you remove the starting / from urlparse.urlparse(url).path with [1:]

you add a trailing / in get_domain_with_protocol to the first part of the url

You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).

generator

Instead of returning a list, you can also make a generator:

for index in range(len(parts)):
 yield domain_with_protocol + '/'.join(parts[:index + 1])

iteration 1

In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)

iteration 2

I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate, but in Python 2, you'd have to write your own accumulator:

def accumulate_parts(parts, sep='/'):
 parts_iter = iter(parts)
 substring = next(parts_iter)
 yield substring
 for part in parts_iter:
 substring += sep + part
 yield substring

def get_url_directories_accumulate(url):
 path = urlparse.urlparse(url).path
 parts = path.strip('/').split('/')
 domain_with_protocol = get_domain_with_protocol(url)
 for substring in accumulate_parts(parts):
 yield domain_with_protocol + substring

timings

I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.

code

Full code and timings can be found here.

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196017%2fsplitting-urls-into-hierarchy-level-directories%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

All in all, this is not bad. A few remarks:

Python 2

Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:

from __future__ import print_function
try:
 import urlparse
except ModuleNotFoundError:
 import urllib.parse as urlparse

docstring

Adding a docstring can better explain what the different methods do.

`range`

You don't need the range as list, so you might as well use xrange.

'/'

You do a lot of operations with '/'.

you remove a trailing / if it exists

you remove the starting / from urlparse.urlparse(url).path with [1:]

you add a trailing / in get_domain_with_protocol to the first part of the url

You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).

generator

Instead of returning a list, you can also make a generator:

for index in range(len(parts)):
 yield domain_with_protocol + '/'.join(parts[:index + 1])

iteration 1

In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)

iteration 2

def accumulate_parts(parts, sep='/'):
 parts_iter = iter(parts)
 substring = next(parts_iter)
 yield substring
 for part in parts_iter:
 substring += sep + part
 yield substring

def get_url_directories_accumulate(url):
 path = urlparse.urlparse(url).path
 parts = path.strip('/').split('/')
 domain_with_protocol = get_domain_with_protocol(url)
 for substring in accumulate_parts(parts):
 yield domain_with_protocol + substring

timings

code

Full code and timings can be found here.

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

add a commentÂ |Â

up vote
1
down vote

All in all, this is not bad. A few remarks:

Python 2

Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:

from __future__ import print_function
try:
 import urlparse
except ModuleNotFoundError:
 import urllib.parse as urlparse

docstring

Adding a docstring can better explain what the different methods do.

`range`

You don't need the range as list, so you might as well use xrange.

'/'

You do a lot of operations with '/'.

you remove a trailing / if it exists

you remove the starting / from urlparse.urlparse(url).path with [1:]

you add a trailing / in get_domain_with_protocol to the first part of the url

You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).

generator

Instead of returning a list, you can also make a generator:

for index in range(len(parts)):
 yield domain_with_protocol + '/'.join(parts[:index + 1])

iteration 1

In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)

iteration 2

def accumulate_parts(parts, sep='/'):
 parts_iter = iter(parts)
 substring = next(parts_iter)
 yield substring
 for part in parts_iter:
 substring += sep + part
 yield substring

def get_url_directories_accumulate(url):
 path = urlparse.urlparse(url).path
 parts = path.strip('/').split('/')
 domain_with_protocol = get_domain_with_protocol(url)
 for substring in accumulate_parts(parts):
 yield domain_with_protocol + substring

timings

code

Full code and timings can be found here.

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

add a commentÂ |Â

up vote
1
down vote

All in all, this is not bad. A few remarks:

Python 2

Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:

from __future__ import print_function
try:
 import urlparse
except ModuleNotFoundError:
 import urllib.parse as urlparse

docstring

Adding a docstring can better explain what the different methods do.

`range`

You don't need the range as list, so you might as well use xrange.

'/'

You do a lot of operations with '/'.

you remove a trailing / if it exists

you remove the starting / from urlparse.urlparse(url).path with [1:]

you add a trailing / in get_domain_with_protocol to the first part of the url

You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).

generator

Instead of returning a list, you can also make a generator:

for index in range(len(parts)):
 yield domain_with_protocol + '/'.join(parts[:index + 1])

iteration 1

In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)

iteration 2

def accumulate_parts(parts, sep='/'):
 parts_iter = iter(parts)
 substring = next(parts_iter)
 yield substring
 for part in parts_iter:
 substring += sep + part
 yield substring

def get_url_directories_accumulate(url):
 path = urlparse.urlparse(url).path
 parts = path.strip('/').split('/')
 domain_with_protocol = get_domain_with_protocol(url)
 for substring in accumulate_parts(parts):
 yield domain_with_protocol + substring

timings

code

Full code and timings can be found here.

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

All in all, this is not bad. A few remarks:

Python 2

Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:

from __future__ import print_function
try:
 import urlparse
except ModuleNotFoundError:
 import urllib.parse as urlparse

docstring

Adding a docstring can better explain what the different methods do.

`range`

You don't need the range as list, so you might as well use xrange.

'/'

You do a lot of operations with '/'.

you remove a trailing / if it exists

you remove the starting / from urlparse.urlparse(url).path with [1:]

you add a trailing / in get_domain_with_protocol to the first part of the url

You can combine 1 and 2 by doing path.strip('/'). Or you can drop both 2 and 3, and iterate over range(1, len(parts).

generator

Instead of returning a list, you can also make a generator:

for index in range(len(parts)):
 yield domain_with_protocol + '/'.join(parts[:index + 1])

iteration 1

In general I try not to do things like range(len(parts)), but use enumerate. Here you could do for index, _ in enumerate(parts)

iteration 2

def accumulate_parts(parts, sep='/'):
 parts_iter = iter(parts)
 substring = next(parts_iter)
 yield substring
 for part in parts_iter:
 substring += sep + part
 yield substring

def get_url_directories_accumulate(url):
 path = urlparse.urlparse(url).path
 parts = path.strip('/').split('/')
 domain_with_protocol = get_domain_with_protocol(url)
 for substring in accumulate_parts(parts):
 yield domain_with_protocol + substring

timings

code

Full code and timings can be found here.

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

edited Jul 8 at 12:48

Daniel

4,1132836

edited Jul 8 at 12:48

Daniel

4,1132836

edited Jul 8 at 12:48

Daniel

4,1132836

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

answered Jun 8 at 9:23

Maarten FabrÃ©

3,204214

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Splitting URLs into hierarchy-level directories

1 Answer 1

Python 2

docstring

range

'/'

generator

iteration 1

iteration 2

timings

code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Python 2

docstring

range

'/'

generator

iteration 1

iteration 2

timings

code

Python 2

docstring

range

'/'

generator

iteration 1

iteration 2

timings

code

Python 2

docstring

range

'/'

generator

iteration 1

iteration 2

timings

code

Python 2

docstring

range

'/'

generator

iteration 1

iteration 2

timings

code

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation

1 Answer
1

`range`

1 Answer
1

1 Answer
1

`range`

`range`

`range`

`range`