Splitting URLs into hierarchy-level directories
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
The goal is to split a URL like http://q.com/a/b/c
to:
['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']
My code:
import urlparse
def get_domain_with_protocol(url):
url_parts = urlparse.urlparse(url)
return "scheme://netloc/".format(
scheme=url_parts.scheme,
netloc=url_parts.netloc
)
def get_url_directories_combinations(url):
url = url.rstrip('/')
path = urlparse.urlparse(url).path[1:]
parts = path.split('/')
domain_with_protocol = get_domain_with_protocol(url)
url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
for index in range(len(parts))]
return url_combinations
print get_url_directories_combinations('http://example.com/a/b/c/')
I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink
and posixpath
etc. which can be used for path manipulation etc.
How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.
python strings python-2.7 combinatorics url
add a comment |Â
up vote
4
down vote
favorite
The goal is to split a URL like http://q.com/a/b/c
to:
['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']
My code:
import urlparse
def get_domain_with_protocol(url):
url_parts = urlparse.urlparse(url)
return "scheme://netloc/".format(
scheme=url_parts.scheme,
netloc=url_parts.netloc
)
def get_url_directories_combinations(url):
url = url.rstrip('/')
path = urlparse.urlparse(url).path[1:]
parts = path.split('/')
domain_with_protocol = get_domain_with_protocol(url)
url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
for index in range(len(parts))]
return url_combinations
print get_url_directories_combinations('http://example.com/a/b/c/')
I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink
and posixpath
etc. which can be used for path manipulation etc.
How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.
python strings python-2.7 combinatorics url
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
The goal is to split a URL like http://q.com/a/b/c
to:
['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']
My code:
import urlparse
def get_domain_with_protocol(url):
url_parts = urlparse.urlparse(url)
return "scheme://netloc/".format(
scheme=url_parts.scheme,
netloc=url_parts.netloc
)
def get_url_directories_combinations(url):
url = url.rstrip('/')
path = urlparse.urlparse(url).path[1:]
parts = path.split('/')
domain_with_protocol = get_domain_with_protocol(url)
url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
for index in range(len(parts))]
return url_combinations
print get_url_directories_combinations('http://example.com/a/b/c/')
I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink
and posixpath
etc. which can be used for path manipulation etc.
How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.
python strings python-2.7 combinatorics url
The goal is to split a URL like http://q.com/a/b/c
to:
['http://q.com/a', 'http://q.com/a/b', 'http://q.com/a/b/c']
My code:
import urlparse
def get_domain_with_protocol(url):
url_parts = urlparse.urlparse(url)
return "scheme://netloc/".format(
scheme=url_parts.scheme,
netloc=url_parts.netloc
)
def get_url_directories_combinations(url):
url = url.rstrip('/')
path = urlparse.urlparse(url).path[1:]
parts = path.split('/')
domain_with_protocol = get_domain_with_protocol(url)
url_combinations = [domain_with_protocol + '/'.join(parts[:index + 1])
for index in range(len(parts))]
return url_combinations
print get_url_directories_combinations('http://example.com/a/b/c/')
I think this code is ugly and a more Pythonic approach might be possible. There are libraries like hyperlink
and posixpath
etc. which can be used for path manipulation etc.
How would you improve this code? I'm open to using well-tested, popular libraries if that means less code and more stability.
python strings python-2.7 combinatorics url
asked Jun 7 at 10:13
UnderpoweredNinja
412
412
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
All in all, this is not bad. A few remarks:
Python 2
Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:
from __future__ import print_function
try:
import urlparse
except ModuleNotFoundError:
import urllib.parse as urlparse
docstring
Adding a docstring can better explain what the different methods do.
range
You don't need the range as list, so you might as well use xrange
.
To make the code Python 3 compatible, you'd have to add xrange = range
to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.
'/'
You do a lot of operations with '/'
.
- you remove a trailing
/
if it exists - you remove the starting
/
fromurlparse.urlparse(url).path
with[1:]
- you add a trailing
/
inget_domain_with_protocol
to the first part of the url
You can combine 1 and 2 by doing path.strip('/')
. Or you can drop both 2 and 3, and iterate over range(1, len(parts)
.
generator
Instead of returning a list, you can also make a generator:
for index in range(len(parts)):
yield domain_with_protocol + '/'.join(parts[:index + 1])
iteration 1
In general I try not to do things like range(len(parts))
, but use enumerate
. Here you could do for index, _ in enumerate(parts)
iteration 2
I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts
would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate
, but in Python 2, you'd have to write your own accumulator:
def accumulate_parts(parts, sep='/'):
parts_iter = iter(parts)
substring = next(parts_iter)
yield substring
for part in parts_iter:
substring += sep + part
yield substring
def get_url_directories_accumulate(url):
path = urlparse.urlparse(url).path
parts = path.strip('/').split('/')
domain_with_protocol = get_domain_with_protocol(url)
for substring in accumulate_parts(parts):
yield domain_with_protocol + substring
timings
I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.
code
Full code and timings can be found here.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
All in all, this is not bad. A few remarks:
Python 2
Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:
from __future__ import print_function
try:
import urlparse
except ModuleNotFoundError:
import urllib.parse as urlparse
docstring
Adding a docstring can better explain what the different methods do.
range
You don't need the range as list, so you might as well use xrange
.
To make the code Python 3 compatible, you'd have to add xrange = range
to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.
'/'
You do a lot of operations with '/'
.
- you remove a trailing
/
if it exists - you remove the starting
/
fromurlparse.urlparse(url).path
with[1:]
- you add a trailing
/
inget_domain_with_protocol
to the first part of the url
You can combine 1 and 2 by doing path.strip('/')
. Or you can drop both 2 and 3, and iterate over range(1, len(parts)
.
generator
Instead of returning a list, you can also make a generator:
for index in range(len(parts)):
yield domain_with_protocol + '/'.join(parts[:index + 1])
iteration 1
In general I try not to do things like range(len(parts))
, but use enumerate
. Here you could do for index, _ in enumerate(parts)
iteration 2
I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts
would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate
, but in Python 2, you'd have to write your own accumulator:
def accumulate_parts(parts, sep='/'):
parts_iter = iter(parts)
substring = next(parts_iter)
yield substring
for part in parts_iter:
substring += sep + part
yield substring
def get_url_directories_accumulate(url):
path = urlparse.urlparse(url).path
parts = path.strip('/').split('/')
domain_with_protocol = get_domain_with_protocol(url)
for substring in accumulate_parts(parts):
yield domain_with_protocol + substring
timings
I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.
code
Full code and timings can be found here.
add a comment |Â
up vote
1
down vote
All in all, this is not bad. A few remarks:
Python 2
Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:
from __future__ import print_function
try:
import urlparse
except ModuleNotFoundError:
import urllib.parse as urlparse
docstring
Adding a docstring can better explain what the different methods do.
range
You don't need the range as list, so you might as well use xrange
.
To make the code Python 3 compatible, you'd have to add xrange = range
to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.
'/'
You do a lot of operations with '/'
.
- you remove a trailing
/
if it exists - you remove the starting
/
fromurlparse.urlparse(url).path
with[1:]
- you add a trailing
/
inget_domain_with_protocol
to the first part of the url
You can combine 1 and 2 by doing path.strip('/')
. Or you can drop both 2 and 3, and iterate over range(1, len(parts)
.
generator
Instead of returning a list, you can also make a generator:
for index in range(len(parts)):
yield domain_with_protocol + '/'.join(parts[:index + 1])
iteration 1
In general I try not to do things like range(len(parts))
, but use enumerate
. Here you could do for index, _ in enumerate(parts)
iteration 2
I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts
would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate
, but in Python 2, you'd have to write your own accumulator:
def accumulate_parts(parts, sep='/'):
parts_iter = iter(parts)
substring = next(parts_iter)
yield substring
for part in parts_iter:
substring += sep + part
yield substring
def get_url_directories_accumulate(url):
path = urlparse.urlparse(url).path
parts = path.strip('/').split('/')
domain_with_protocol = get_domain_with_protocol(url)
for substring in accumulate_parts(parts):
yield domain_with_protocol + substring
timings
I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.
code
Full code and timings can be found here.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
All in all, this is not bad. A few remarks:
Python 2
Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:
from __future__ import print_function
try:
import urlparse
except ModuleNotFoundError:
import urllib.parse as urlparse
docstring
Adding a docstring can better explain what the different methods do.
range
You don't need the range as list, so you might as well use xrange
.
To make the code Python 3 compatible, you'd have to add xrange = range
to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.
'/'
You do a lot of operations with '/'
.
- you remove a trailing
/
if it exists - you remove the starting
/
fromurlparse.urlparse(url).path
with[1:]
- you add a trailing
/
inget_domain_with_protocol
to the first part of the url
You can combine 1 and 2 by doing path.strip('/')
. Or you can drop both 2 and 3, and iterate over range(1, len(parts)
.
generator
Instead of returning a list, you can also make a generator:
for index in range(len(parts)):
yield domain_with_protocol + '/'.join(parts[:index + 1])
iteration 1
In general I try not to do things like range(len(parts))
, but use enumerate
. Here you could do for index, _ in enumerate(parts)
iteration 2
I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts
would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate
, but in Python 2, you'd have to write your own accumulator:
def accumulate_parts(parts, sep='/'):
parts_iter = iter(parts)
substring = next(parts_iter)
yield substring
for part in parts_iter:
substring += sep + part
yield substring
def get_url_directories_accumulate(url):
path = urlparse.urlparse(url).path
parts = path.strip('/').split('/')
domain_with_protocol = get_domain_with_protocol(url)
for substring in accumulate_parts(parts):
yield domain_with_protocol + substring
timings
I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.
code
Full code and timings can be found here.
All in all, this is not bad. A few remarks:
Python 2
Why choose Python 2? Python 3 has a lot of advantages, and will get more support in the future. Even if you need to code for Python 2, you can make your code compatible with both versions:
from __future__ import print_function
try:
import urlparse
except ModuleNotFoundError:
import urllib.parse as urlparse
docstring
Adding a docstring can better explain what the different methods do.
range
You don't need the range as list, so you might as well use xrange
.
To make the code Python 3 compatible, you'd have to add xrange = range
to the except clause on the import, so this might not be worth it in this case, but in general, use the iterable version as much as possible.
'/'
You do a lot of operations with '/'
.
- you remove a trailing
/
if it exists - you remove the starting
/
fromurlparse.urlparse(url).path
with[1:]
- you add a trailing
/
inget_domain_with_protocol
to the first part of the url
You can combine 1 and 2 by doing path.strip('/')
. Or you can drop both 2 and 3, and iterate over range(1, len(parts)
.
generator
Instead of returning a list, you can also make a generator:
for index in range(len(parts)):
yield domain_with_protocol + '/'.join(parts[:index + 1])
iteration 1
In general I try not to do things like range(len(parts))
, but use enumerate
. Here you could do for index, _ in enumerate(parts)
iteration 2
I try to avoid iterating over the index, and try to use generators as intermediate product instead of lists. Imagine parts
would be an iterable instead of a list, your approach would not work.
In Python 3, you could use itertools.accumulate
, but in Python 2, you'd have to write your own accumulator:
def accumulate_parts(parts, sep='/'):
parts_iter = iter(parts)
substring = next(parts_iter)
yield substring
for part in parts_iter:
substring += sep + part
yield substring
def get_url_directories_accumulate(url):
path = urlparse.urlparse(url).path
parts = path.strip('/').split('/')
domain_with_protocol = get_domain_with_protocol(url)
for substring in accumulate_parts(parts):
yield domain_with_protocol + substring
timings
I've timed these variations both in Python 2 and Python 3, and all of them are within a few % of each other, so you can pick the one that suits you best, and you'll be able to understand in a few months/years.
code
Full code and timings can be found here.
edited Jul 8 at 12:48
Daniel
4,1132836
4,1132836
answered Jun 8 at 9:23
Maarten Fabré
3,204214
3,204214
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196017%2fsplitting-urls-into-hierarchy-level-directories%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password