Search for subsets within a 250k line text dataset

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

The code below works pretty well - it takes about 4 seconds (this seems too slow for me) to find a group that's located in the tail area of the text file I'm searching. The text file has Ã¢Â‰Âƒ 250,000 lines with 20 elements per line. I cut my teeth on other programming languages and picked up Python just for this current project I'm working on, so I really am a neophyte when it comes to python efficiency.

with open(file) as infile:
 datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
 for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
 if key:
 super_list = group
 break

'patient_id' is a string of digits

file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?

edited Apr 27 at 19:10

200_success

123k14142399

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â€“Â Jean-FranÃ§ois Fabre
Apr 27 at 20:16

I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â€“Â Josiah
Apr 27 at 21:50

Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â€“Â SteveJ
May 10 at 3:47

add a commentÂ |Â

up vote
5
down vote

favorite

with open(file) as infile:
 datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
 for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
 if key:
 super_list = group
 break

'patient_id' is a string of digits

file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?

edited Apr 27 at 19:10

200_success

123k14142399

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â€“Â Jean-FranÃ§ois Fabre
Apr 27 at 20:16

I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â€“Â Josiah
Apr 27 at 21:50

Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â€“Â SteveJ
May 10 at 3:47

add a commentÂ |Â

up vote
5
down vote

favorite

with open(file) as infile:
 datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
 for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
 if key:
 super_list = group
 break

'patient_id' is a string of digits

file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?

edited Apr 27 at 19:10

200_success

123k14142399

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

with open(file) as infile:
 datadictionary = csv.DictReader(infile, dialect='excel-tab', quoting=csv.QUOTE_NONE)
 for key, group in itertools.groupby(datadictionary, key=lambda x:x[patient_number_field_header] == patient_id):
 if key:
 super_list = group
 break

'patient_id' is a string of digits

file is a text file

I'm wondering what you think - how can I make this more efficient? Am I "doing it wrong"?

edited Apr 27 at 19:10

200_success

123k14142399

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

edited Apr 27 at 19:10

200_success

123k14142399

edited Apr 27 at 19:10

200_success

123k14142399

edited Apr 27 at 19:10

200_success

123k14142399

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

asked Apr 27 at 16:52

SynchronizeYourDogma

1264

the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â€“Â Jean-FranÃ§ois Fabre
Apr 27 at 20:16

I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â€“Â Josiah
Apr 27 at 21:50

Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â€“Â SteveJ
May 10 at 3:47

add a commentÂ |Â

the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â€“Â Jean-FranÃ§ois Fabre
Apr 27 at 20:16

I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â€“Â Josiah
Apr 27 at 21:50

Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â€“Â SteveJ
May 10 at 3:47

the code is probably I/O bound. Difficult to speed it up. I would have not used DictReader but a standard reader & the index of the column to speed up grouping by providing a faster key. But that's not going to save much
â€“Â Jean-FranÃ§ois Fabre
Apr 27 at 20:16

I/O bound seems very plausible to me. That makes it a case of thinking about the environment that the program is running in. For example, moving the file to a RAM disk would both help confirm the suspicion, and possibly help solve it. (Although moving it to the RAM disk is of course not instantaneous)
â€“Â Josiah
Apr 27 at 21:50

Might look at Pandas -- it is super optimized. You can read from csv or excel directly then manipulate your data in declarative ways. You will likely be hard pressed to beat anything they are already doing. pandas.pydata.org
â€“Â SteveJ
May 10 at 3:47

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.

Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.

answered Apr 28 at 6:28

Roland Illig

10.4k11543

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193096%2fsearch-for-subsets-within-a-250k-line-text-dataset%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.

Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.

answered Apr 28 at 6:28

Roland Illig

10.4k11543

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

add a commentÂ |Â

up vote
2
down vote

I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.

Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.

answered Apr 28 at 6:28

Roland Illig

10.4k11543

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

add a commentÂ |Â

up vote
2
down vote

I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.

Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.

answered Apr 28 at 6:28

Roland Illig

10.4k11543

I'm assuming your code does some background processing, because otherwise the patient data would be stored in a database, not in a text file. In that scenario, 4 seconds is probably ok.

Instead of grouping the records, you could simply filter them. That saves two lines of code, but will probably not be much faster.

answered Apr 28 at 6:28

Roland Illig

10.4k11543

answered Apr 28 at 6:28

Roland Illig

10.4k11543

answered Apr 28 at 6:28

Roland Illig

10.4k11543

answered Apr 28 at 6:28

Roland Illig

10.4k11543

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

add a commentÂ |Â

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

I'm transferring all the data from one medical system that's being taken offline to a new one. No real processing needs doing, just generation of a textfile with a header. Its ten years worth of data. I'm just ballparking how long it will take. It has to be grouped because the files are slightly unordered. It looks like I'll probably be using this method. Thanks for your input
â€“Â SynchronizeYourDogma
Apr 30 at 21:05

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr