Search for subsets within a 250k line text dataset
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has â 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.
with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break
- 'patient_id' is a string of digits
- file is a text file
I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?
python performance csv
add a comment |Â
up vote
5
down vote
favorite
The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has â 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.
with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break
- 'patient_id' is a string of digits
- file is a text file
I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?
python performance csv
the code is probably I/O bound. Difficult to speed it up. I would have not usedDictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â Jean-François Fabre
Apr 27 at 20:16
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has â 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.
with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break
- 'patient_id' is a string of digits
- file is a text file
I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?
python performance csv
The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has â 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.
with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break
- 'patient_id' is a string of digits
- file is a text file
I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?
python performance csv
edited Apr 27 at 19:10
200_success
123k14142399
123k14142399
asked Apr 27 at 16:52
SynchronizeYourDogma
1264
1264
the code is probably I/O bound. Difficult to speed it up. I would have not usedDictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â Jean-François Fabre
Apr 27 at 20:16
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47
add a comment |Â
the code is probably I/O bound. Difficult to speed it up. I would have not usedDictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â Jean-François Fabre
Apr 27 at 20:16
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47
the code is probably I/O bound. Difficult to speed it up. I would have not used
DictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save muchâ Jean-François Fabre
Apr 27 at 20:16
the code is probably I/O bound. Difficult to speed it up. I would have not used
DictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save muchâ Jean-François Fabre
Apr 27 at 20:16
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.
Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.
Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
add a comment |Â
up vote
2
down vote
I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.
Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.
Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.
I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.
Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.
answered Apr 28 at 6:28
Roland Illig
10.4k11543
10.4k11543
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
add a comment |Â
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â SynchronizeYourDogma
Apr 30 at 21:05
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193096%2fsearch-for-subsets-within-a-250k-line-text-dataset%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
the code is probably I/O bound. Difficult to speed it up. I would have not used
DictReader
but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save muchâ Jean-François Fabre
Apr 27 at 20:16
I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â Josiah
Apr 27 at 21:50
Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â SteveJ
May 10 at 3:47