Search for subsets within a 250k line text dataset

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has ≃ 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.



with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break


  1. 'patient_id' is a string of digits

  2. file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?







share|improve this question





















  • the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
    – Jean-François Fabre
    Apr 27 at 20:16










  • I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
    – Josiah
    Apr 27 at 21:50










  • Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
    – SteveJ
    May 10 at 3:47
















up vote
5
down vote

favorite












The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has ≃ 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.



with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break


  1. 'patient_id' is a string of digits

  2. file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?







share|improve this question





















  • the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
    – Jean-François Fabre
    Apr 27 at 20:16










  • I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
    – Josiah
    Apr 27 at 21:50










  • Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
    – SteveJ
    May 10 at 3:47












up vote
5
down vote

favorite









up vote
5
down vote

favorite











The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has ≃ 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.



with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break


  1. 'patient_id' is a string of digits

  2. file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?







share|improve this question













The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has ≃ 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.



with open(file) as infile:
datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
if key:
super_list = group
break


  1. 'patient_id' is a string of digits

  2. file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?









share|improve this question












share|improve this question




share|improve this question








edited Apr 27 at 19:10









200_success

123k14142399




123k14142399









asked Apr 27 at 16:52









SynchronizeYourDogma

1264




1264











  • the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
    – Jean-François Fabre
    Apr 27 at 20:16










  • I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
    – Josiah
    Apr 27 at 21:50










  • Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
    – SteveJ
    May 10 at 3:47
















  • the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
    – Jean-François Fabre
    Apr 27 at 20:16










  • I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
    – Josiah
    Apr 27 at 21:50










  • Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
    – SteveJ
    May 10 at 3:47















the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
– Jean-François Fabre
Apr 27 at 20:16




the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
– Jean-François Fabre
Apr 27 at 20:16












I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
– Josiah
Apr 27 at 21:50




I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
– Josiah
Apr 27 at 21:50












Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
– SteveJ
May 10 at 3:47




Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
– SteveJ
May 10 at 3:47










1 Answer
1






active

oldest

votes

















up vote
2
down vote













I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.



Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.






share|improve this answer





















  • I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
    – SynchronizeYourDogma
    Apr 30 at 21:05











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193096%2fsearch-for-subsets-within-a-250k-line-text-dataset%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.



Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.






share|improve this answer





















  • I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
    – SynchronizeYourDogma
    Apr 30 at 21:05















up vote
2
down vote













I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.



Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.






share|improve this answer





















  • I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
    – SynchronizeYourDogma
    Apr 30 at 21:05













up vote
2
down vote










up vote
2
down vote









I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.



Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.






share|improve this answer













I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.



Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.







share|improve this answer













share|improve this answer



share|improve this answer











answered Apr 28 at 6:28









Roland Illig

10.4k11543




10.4k11543











  • I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
    – SynchronizeYourDogma
    Apr 30 at 21:05

















  • I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
    – SynchronizeYourDogma
    Apr 30 at 21:05
















I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
– SynchronizeYourDogma
Apr 30 at 21:05





I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
– SynchronizeYourDogma
Apr 30 at 21:05













 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193096%2fsearch-for-subsets-within-a-250k-line-text-dataset%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?