Highly nested bioinformatics processing

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
6
down vote

favorite
1












I have a script for parsing BAM files. The script's input thus is lines like



E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S


(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)



The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.




I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.




#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>)

#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;

#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"


#write out edge features
foreach my $key_chr (sort(keys %IDhash))

foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))

foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";

for my $l ('S','I','D')

if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";



print $fh_edge "n";




#write out "all" features
foreach my $key_chr (sort(keys %softhash))

foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))

print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";




Edit: Gist!







share|improve this question

















  • 2




    Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
    – Phrancis
    Jun 21 at 1:22






  • 1




    There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
    – 200_success
    Jun 21 at 2:58






  • 1




    @randoms Good, I hope you get great answers!
    – Phrancis
    Jun 21 at 5:47






  • 1




    StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
    – Peter Taylor
    Jun 21 at 15:27






  • 2




    Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
    – Peter Taylor
    Jun 21 at 15:33
















up vote
6
down vote

favorite
1












I have a script for parsing BAM files. The script's input thus is lines like



E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S


(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)



The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.




I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.




#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>)

#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;

#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"


#write out edge features
foreach my $key_chr (sort(keys %IDhash))

foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))

foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";

for my $l ('S','I','D')

if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";



print $fh_edge "n";




#write out "all" features
foreach my $key_chr (sort(keys %softhash))

foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))

print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";




Edit: Gist!







share|improve this question

















  • 2




    Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
    – Phrancis
    Jun 21 at 1:22






  • 1




    There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
    – 200_success
    Jun 21 at 2:58






  • 1




    @randoms Good, I hope you get great answers!
    – Phrancis
    Jun 21 at 5:47






  • 1




    StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
    – Peter Taylor
    Jun 21 at 15:27






  • 2




    Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
    – Peter Taylor
    Jun 21 at 15:33












up vote
6
down vote

favorite
1









up vote
6
down vote

favorite
1






1





I have a script for parsing BAM files. The script's input thus is lines like



E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S


(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)



The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.




I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.




#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>)

#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;

#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"


#write out edge features
foreach my $key_chr (sort(keys %IDhash))

foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))

foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";

for my $l ('S','I','D')

if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";



print $fh_edge "n";




#write out "all" features
foreach my $key_chr (sort(keys %softhash))

foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))

print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";




Edit: Gist!







share|improve this question













I have a script for parsing BAM files. The script's input thus is lines like



E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S


(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)



The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.




I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.




#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>)

#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;

#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"


#write out edge features
foreach my $key_chr (sort(keys %IDhash))

foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))

foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";

for my $l ('S','I','D')

if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";



print $fh_edge "n";




#write out "all" features
foreach my $key_chr (sort(keys %softhash))

foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))

print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";




Edit: Gist!









share|improve this question












share|improve this question




share|improve this question








edited Jun 21 at 17:26
























asked Jun 21 at 0:51









Randoms

1334




1334







  • 2




    Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
    – Phrancis
    Jun 21 at 1:22






  • 1




    There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
    – 200_success
    Jun 21 at 2:58






  • 1




    @randoms Good, I hope you get great answers!
    – Phrancis
    Jun 21 at 5:47






  • 1




    StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
    – Peter Taylor
    Jun 21 at 15:27






  • 2




    Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
    – Peter Taylor
    Jun 21 at 15:33












  • 2




    Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
    – Phrancis
    Jun 21 at 1:22






  • 1




    There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
    – 200_success
    Jun 21 at 2:58






  • 1




    @randoms Good, I hope you get great answers!
    – Phrancis
    Jun 21 at 5:47






  • 1




    StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
    – Peter Taylor
    Jun 21 at 15:27






  • 2




    Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
    – Peter Taylor
    Jun 21 at 15:33







2




2




Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
– Phrancis
Jun 21 at 1:22




Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
– Phrancis
Jun 21 at 1:22




1




1




There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
– 200_success
Jun 21 at 2:58




There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification — what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
– 200_success
Jun 21 at 2:58




1




1




@randoms Good, I hope you get great answers!
– Phrancis
Jun 21 at 5:47




@randoms Good, I hope you get great answers!
– Phrancis
Jun 21 at 5:47




1




1




StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
– Peter Taylor
Jun 21 at 15:27




StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
– Peter Taylor
Jun 21 at 15:27




2




2




Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
– Peter Taylor
Jun 21 at 15:33




Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
– Peter Taylor
Jun 21 at 15:33










1 Answer
1






active

oldest

votes

















up vote
4
down vote



accepted










Non-code / very-high-level considerations.



The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).




My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement




S may only have H operations between them and the ends of the CIGAR string




However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.




And I think it is worth pointing out with respect to your observation that




I rarely use Perl, but it seems highly inefficient.




There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).




Code considerations



Possible bugs?



 my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)


If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].



In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?




Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.



The most obvious optimisation relates to softhash, and in particular the loop




 for (my $pos = $var_start; $pos < $var_end; $pos++) 
$softhash$a[2]$pos++;




If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with



 $softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;


and then modify the output loop at the end to



foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";


$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;



The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.




If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.



The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.



One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).






share|improve this answer























  • These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
    – Randoms
    Jun 27 at 16:54










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196933%2fhighly-nested-bioinformatics-processing%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
4
down vote



accepted










Non-code / very-high-level considerations.



The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).




My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement




S may only have H operations between them and the ends of the CIGAR string




However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.




And I think it is worth pointing out with respect to your observation that




I rarely use Perl, but it seems highly inefficient.




There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).




Code considerations



Possible bugs?



 my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)


If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].



In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?




Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.



The most obvious optimisation relates to softhash, and in particular the loop




 for (my $pos = $var_start; $pos < $var_end; $pos++) 
$softhash$a[2]$pos++;




If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with



 $softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;


and then modify the output loop at the end to



foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";


$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;



The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.




If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.



The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.



One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).






share|improve this answer























  • These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
    – Randoms
    Jun 27 at 16:54














up vote
4
down vote



accepted










Non-code / very-high-level considerations.



The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).




My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement




S may only have H operations between them and the ends of the CIGAR string




However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.




And I think it is worth pointing out with respect to your observation that




I rarely use Perl, but it seems highly inefficient.




There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).




Code considerations



Possible bugs?



 my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)


If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].



In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?




Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.



The most obvious optimisation relates to softhash, and in particular the loop




 for (my $pos = $var_start; $pos < $var_end; $pos++) 
$softhash$a[2]$pos++;




If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with



 $softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;


and then modify the output loop at the end to



foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";


$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;



The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.




If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.



The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.



One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).






share|improve this answer























  • These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
    – Randoms
    Jun 27 at 16:54












up vote
4
down vote



accepted







up vote
4
down vote



accepted






Non-code / very-high-level considerations.



The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).




My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement




S may only have H operations between them and the ends of the CIGAR string




However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.




And I think it is worth pointing out with respect to your observation that




I rarely use Perl, but it seems highly inefficient.




There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).




Code considerations



Possible bugs?



 my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)


If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].



In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?




Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.



The most obvious optimisation relates to softhash, and in particular the loop




 for (my $pos = $var_start; $pos < $var_end; $pos++) 
$softhash$a[2]$pos++;




If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with



 $softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;


and then modify the output loop at the end to



foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";


$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;



The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.




If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.



The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.



One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).






share|improve this answer















Non-code / very-high-level considerations.



The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).




My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement




S may only have H operations between them and the ends of the CIGAR string




However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.




And I think it is worth pointing out with respect to your observation that




I rarely use Perl, but it seems highly inefficient.




There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).




Code considerations



Possible bugs?



 my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)


If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].



In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?




Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.



The most obvious optimisation relates to softhash, and in particular the loop




 for (my $pos = $var_start; $pos < $var_end; $pos++) 
$softhash$a[2]$pos++;




If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with



 $softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;


and then modify the output loop at the end to



foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";


$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;



The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.




If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.



The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.



One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).







share|improve this answer















share|improve this answer



share|improve this answer








edited Jun 21 at 18:58


























answered Jun 21 at 18:50









Peter Taylor

14k2454




14k2454











  • These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
    – Randoms
    Jun 27 at 16:54
















  • These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
    – Randoms
    Jun 27 at 16:54















These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
– Randoms
Jun 27 at 16:54




These are all great tips! I've implemented pretty much all of them—and in Python, as you suggested.
– Randoms
Jun 27 at 16:54












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196933%2fhighly-nested-bioinformatics-processing%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation