Highly nested bioinformatics processing
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
6
down vote
favorite
I have a script for parsing BAM files. The script's input thus is lines like
E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S
(The BAM file is actually run through cut -f1-4,6
before input to this script so this is only a subset of fields.)
The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.
I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.
#!/bin/perl
#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);
#for each line
while (my $line = <STDIN>)
#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;
#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"
#write out edge features
foreach my $key_chr (sort(keys %IDhash))
foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))
foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))
print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";
for my $l ('S','I','D')
if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";
print $fh_edge "n";
#write out "all" features
foreach my $key_chr (sort(keys %softhash))
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";
Edit: Gist!
performance beginner algorithm perl bioinformatics
 |Â
show 9 more comments
up vote
6
down vote
favorite
I have a script for parsing BAM files. The script's input thus is lines like
E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S
(The BAM file is actually run through cut -f1-4,6
before input to this script so this is only a subset of fields.)
The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.
I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.
#!/bin/perl
#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);
#for each line
while (my $line = <STDIN>)
#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;
#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"
#write out edge features
foreach my $key_chr (sort(keys %IDhash))
foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))
foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))
print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";
for my $l ('S','I','D')
if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";
print $fh_edge "n";
#write out "all" features
foreach my $key_chr (sort(keys %softhash))
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";
Edit: Gist!
performance beginner algorithm perl bioinformatics
2
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
1
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
1
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
1
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
2
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33
 |Â
show 9 more comments
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have a script for parsing BAM files. The script's input thus is lines like
E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S
(The BAM file is actually run through cut -f1-4,6
before input to this script so this is only a subset of fields.)
The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.
I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.
#!/bin/perl
#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);
#for each line
while (my $line = <STDIN>)
#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;
#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"
#write out edge features
foreach my $key_chr (sort(keys %IDhash))
foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))
foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))
print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";
for my $l ('S','I','D')
if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";
print $fh_edge "n";
#write out "all" features
foreach my $key_chr (sort(keys %softhash))
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";
Edit: Gist!
performance beginner algorithm perl bioinformatics
I have a script for parsing BAM files. The script's input thus is lines like
E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S
(The BAM file is actually run through cut -f1-4,6
before input to this script so this is only a subset of fields.)
The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.
I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.
#!/bin/perl
#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);
#for each line
while (my $line = <STDIN>)
#read in line and parse
chomp($line);
my @a = split("t", $line);
my $start = $a[3];
my $cigar = $a[4];
my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
my @c = split(/[0-9]*/, $cigar); #keeps track of letters
my $loc = $start; #distance from start of read
my $var_start;
my $var_end;
#loop over type of alignment in cigar string, build hashes of candidate locations
for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I"
#write out edge features
foreach my $key_chr (sort(keys %IDhash))
foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr ))
foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))
print $fh_edge "$key_chrt$key_startt$key_endt";
my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
print $fh_edge "$sum,";
for my $l ('S','I','D')
if (defined($IDhash$key_chr$key_start$key_end$l))
print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
else
print $fh_edge "0,";
print $fh_edge "n";
#write out "all" features
foreach my $key_chr (sort(keys %softhash))
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
print $fh_all "$key_chrt$key_post";
print $fh_all "$softhash$key_chr$key_posn";
Edit: Gist!
performance beginner algorithm perl bioinformatics
edited Jun 21 at 17:26
asked Jun 21 at 0:51
Randoms
1334
1334
2
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
1
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
1
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
1
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
2
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33
 |Â
show 9 more comments
2
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
1
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
1
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
1
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
2
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33
2
2
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
1
1
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
1
1
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
1
1
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
2
2
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33
 |Â
show 9 more comments
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).
My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement
S may only have H operations between them and the ends of the CIGAR string
However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.
And I think it is worth pointing out with respect to your observation that
I rarely use Perl, but it seems highly inefficient.
There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).
Code considerations
Possible bugs?
my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)
If I'm correctly interpreting things, shouldn't $loc = $start
be just before the loop over $j
? Otherwise a CIGAR string with several I/D/S
will be double-counting some of the $b[$j]
.
In addition, I really can't understand how I
and D
can be treated identically. Shouldn't one of them cause $var_start
to go down?
Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.
The most obvious optimisation relates to softhash
, and in particular the loop
for (my $pos = $var_start; $pos < $var_end; $pos++)
$softhash$a[2]$pos++;
If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with
$softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;
and then modify the output loop at the end to
foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";
$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;
The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr )
has fewer items to sort.
If I'm right about $loc
above, it might be worth eliminating the loop over $j
and replacing it with an unconditional update of $var_start
and $var_end
. This probably won't help performance, but it might make the code more readable.
The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]
; I wonder whether $var_start
and $var_end
might be better as $range_start
and $range_end
; and I'm pretty sure that the key_
prefix doesn't convey much useful information. Reusing $chromosome
instead of $key_chr
, for example.
One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++)
because it clearly runs one time too many. Either start the iteration at 1
or find some way to make the indexing start at 0
. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b
and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g)
or something similar. (Not tested).
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).
My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement
S may only have H operations between them and the ends of the CIGAR string
However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.
And I think it is worth pointing out with respect to your observation that
I rarely use Perl, but it seems highly inefficient.
There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).
Code considerations
Possible bugs?
my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)
If I'm correctly interpreting things, shouldn't $loc = $start
be just before the loop over $j
? Otherwise a CIGAR string with several I/D/S
will be double-counting some of the $b[$j]
.
In addition, I really can't understand how I
and D
can be treated identically. Shouldn't one of them cause $var_start
to go down?
Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.
The most obvious optimisation relates to softhash
, and in particular the loop
for (my $pos = $var_start; $pos < $var_end; $pos++)
$softhash$a[2]$pos++;
If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with
$softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;
and then modify the output loop at the end to
foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";
$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;
The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr )
has fewer items to sort.
If I'm right about $loc
above, it might be worth eliminating the loop over $j
and replacing it with an unconditional update of $var_start
and $var_end
. This probably won't help performance, but it might make the code more readable.
The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]
; I wonder whether $var_start
and $var_end
might be better as $range_start
and $range_end
; and I'm pretty sure that the key_
prefix doesn't convey much useful information. Reusing $chromosome
instead of $key_chr
, for example.
One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++)
because it clearly runs one time too many. Either start the iteration at 1
or find some way to make the indexing start at 0
. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b
and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g)
or something similar. (Not tested).
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
add a comment |Â
up vote
4
down vote
accepted
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).
My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement
S may only have H operations between them and the ends of the CIGAR string
However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.
And I think it is worth pointing out with respect to your observation that
I rarely use Perl, but it seems highly inefficient.
There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).
Code considerations
Possible bugs?
my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)
If I'm correctly interpreting things, shouldn't $loc = $start
be just before the loop over $j
? Otherwise a CIGAR string with several I/D/S
will be double-counting some of the $b[$j]
.
In addition, I really can't understand how I
and D
can be treated identically. Shouldn't one of them cause $var_start
to go down?
Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.
The most obvious optimisation relates to softhash
, and in particular the loop
for (my $pos = $var_start; $pos < $var_end; $pos++)
$softhash$a[2]$pos++;
If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with
$softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;
and then modify the output loop at the end to
foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";
$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;
The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr )
has fewer items to sort.
If I'm right about $loc
above, it might be worth eliminating the loop over $j
and replacing it with an unconditional update of $var_start
and $var_end
. This probably won't help performance, but it might make the code more readable.
The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]
; I wonder whether $var_start
and $var_end
might be better as $range_start
and $range_end
; and I'm pretty sure that the key_
prefix doesn't convey much useful information. Reusing $chromosome
instead of $key_chr
, for example.
One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++)
because it clearly runs one time too many. Either start the iteration at 1
or find some way to make the indexing start at 0
. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b
and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g)
or something similar. (Not tested).
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).
My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement
S may only have H operations between them and the ends of the CIGAR string
However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.
And I think it is worth pointing out with respect to your observation that
I rarely use Perl, but it seems highly inefficient.
There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).
Code considerations
Possible bugs?
my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)
If I'm correctly interpreting things, shouldn't $loc = $start
be just before the loop over $j
? Otherwise a CIGAR string with several I/D/S
will be double-counting some of the $b[$j]
.
In addition, I really can't understand how I
and D
can be treated identically. Shouldn't one of them cause $var_start
to go down?
Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.
The most obvious optimisation relates to softhash
, and in particular the loop
for (my $pos = $var_start; $pos < $var_end; $pos++)
$softhash$a[2]$pos++;
If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with
$softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;
and then modify the output loop at the end to
foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";
$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;
The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr )
has fewer items to sort.
If I'm right about $loc
above, it might be worth eliminating the loop over $j
and replacing it with an unconditional update of $var_start
and $var_end
. This probably won't help performance, but it might make the code more readable.
The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]
; I wonder whether $var_start
and $var_end
might be better as $range_start
and $range_end
; and I'm pretty sure that the key_
prefix doesn't convey much useful information. Reusing $chromosome
instead of $key_chr
, for example.
One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++)
because it clearly runs one time too many. Either start the iteration at 1
or find some way to make the indexing start at 0
. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b
and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g)
or something similar. (Not tested).
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).
My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement
S may only have H operations between them and the ends of the CIGAR string
However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.
And I think it is worth pointing out with respect to your observation that
I rarely use Perl, but it seems highly inefficient.
There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).
Code considerations
Possible bugs?
my $loc = $start; #distance from start of read
...
for (my $i = 0; $i <= $#c; $i++)
If I'm correctly interpreting things, shouldn't $loc = $start
be just before the loop over $j
? Otherwise a CIGAR string with several I/D/S
will be double-counting some of the $b[$j]
.
In addition, I really can't understand how I
and D
can be treated identically. Shouldn't one of them cause $var_start
to go down?
Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.
The most obvious optimisation relates to softhash
, and in particular the loop
for (my $pos = $var_start; $pos < $var_end; $pos++)
$softhash$a[2]$pos++;
If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with
$softhash$a[2]$var_start++;
$softhash$a[2]$var_end--;
and then modify the output loop at the end to
foreach my $key_chr (sort(keys %softhash)) {
my $accum = 0;
my $prev_pos = -1;
foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr ))
if ($accum > 0)
for (my $i = $prev_pos; $i < $key_pos; $i++)
print $fh_all "$key_chrt$it$accumn";
$accum += $softhash$key_chr$key_pos;
$prev_pos = $key_pos;
The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr )
has fewer items to sort.
If I'm right about $loc
above, it might be worth eliminating the loop over $j
and replacing it with an unconditional update of $var_start
and $var_end
. This probably won't help performance, but it might make the code more readable.
The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]
; I wonder whether $var_start
and $var_end
might be better as $range_start
and $range_end
; and I'm pretty sure that the key_
prefix doesn't convey much useful information. Reusing $chromosome
instead of $key_chr
, for example.
One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++)
because it clearly runs one time too many. Either start the iteration at 1
or find some way to make the indexing start at 0
. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b
and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g)
or something similar. (Not tested).
edited Jun 21 at 18:58
answered Jun 21 at 18:50
Peter Taylor
14k2454
14k2454
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
add a comment |Â
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
These are all great tips! I've implemented pretty much all of themâÂÂand in Python, as you suggested.
â Randoms
Jun 27 at 16:54
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196933%2fhighly-nested-bioinformatics-processing%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â Phrancis
Jun 21 at 1:22
1
There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification â what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â 200_success
Jun 21 at 2:58
1
@randoms Good, I hope you get great answers!
â Phrancis
Jun 21 at 5:47
1
StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â Peter Taylor
Jun 21 at 15:27
2
Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â Peter Taylor
Jun 21 at 15:33