Highly nested bioinformatics processing

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
6
down vote

favorite

I have a script for parsing BAM files. The script's input thus is lines like

E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S

(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)

The first column is a kind of unique ID. The second is a bitwise flag. The third and fourth describe a chromosome and position in the human genome. The fifth is a CIGAR string, which is mostly what the script parses.

I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.

#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>) 

 #read in line and parse
 chomp($line);
 my @a = split("t", $line);
 my $start = $a[3];
 my $cigar = $a[4];
 my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
 my @c = split(/[0-9]*/, $cigar); #keeps track of letters
 my $loc = $start; #distance from start of read
 my $var_start;
 my $var_end;

 #loop over type of alignment in cigar string, build hashes of candidate locations
 for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I" 


#write out edge features
foreach my $key_chr (sort(keys %IDhash)) 

 foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr )) 

 foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

 print $fh_edge "$key_chrt$key_startt$key_endt";
 my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
 print $fh_edge "$sum,";

 for my $l ('S','I','D') 

 if (defined($IDhash$key_chr$key_start$key_end$l)) 
 print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
 else 
 print $fh_edge "0,";
 
 

 print $fh_edge "n"; 
 
 


#write out "all" features
foreach my $key_chr (sort(keys %softhash)) 

 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 

 print $fh_all "$key_chrt$key_post";
 print $fh_all "$softhash$key_chr$key_posn";

Edit: Gist!

edited Jun 21 at 17:26

asked Jun 21 at 0:51

Randoms

1334

2

Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â€“Â Phrancis
Jun 21 at 1:22

1

There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification Ã¢Â€Â” what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â€“Â 200_success
Jun 21 at 2:58

1

@randoms Good, I hope you get great answers!
â€“Â Phrancis
Jun 21 at 5:47

1

StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â€“Â Peter Taylor
Jun 21 at 15:27

2

Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â€“Â Peter Taylor
Jun 21 at 15:33

Â |Â
show 9 more comments

up vote
6
down vote

favorite

I have a script for parsing BAM files. The script's input thus is lines like

E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S

(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)

I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.

#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>) 

 #read in line and parse
 chomp($line);
 my @a = split("t", $line);
 my $start = $a[3];
 my $cigar = $a[4];
 my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
 my @c = split(/[0-9]*/, $cigar); #keeps track of letters
 my $loc = $start; #distance from start of read
 my $var_start;
 my $var_end;

 #loop over type of alignment in cigar string, build hashes of candidate locations
 for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I" 


#write out edge features
foreach my $key_chr (sort(keys %IDhash)) 

 foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr )) 

 foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

 print $fh_edge "$key_chrt$key_startt$key_endt";
 my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
 print $fh_edge "$sum,";

 for my $l ('S','I','D') 

 if (defined($IDhash$key_chr$key_start$key_end$l)) 
 print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
 else 
 print $fh_edge "0,";
 
 

 print $fh_edge "n"; 
 
 


#write out "all" features
foreach my $key_chr (sort(keys %softhash)) 

 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 

 print $fh_all "$key_chrt$key_post";
 print $fh_all "$softhash$key_chr$key_posn";

Edit: Gist!

edited Jun 21 at 17:26

asked Jun 21 at 0:51

Randoms

1334

2

Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â€“Â Phrancis
Jun 21 at 1:22

1

There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification Ã¢Â€Â” what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â€“Â 200_success
Jun 21 at 2:58

1

@randoms Good, I hope you get great answers!
â€“Â Phrancis
Jun 21 at 5:47

1

StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â€“Â Peter Taylor
Jun 21 at 15:27

2

Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â€“Â Peter Taylor
Jun 21 at 15:33

Â |Â
show 9 more comments

up vote
6
down vote

favorite

I have a script for parsing BAM files. The script's input thus is lines like

E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S

(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)

I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.

#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>) 

 #read in line and parse
 chomp($line);
 my @a = split("t", $line);
 my $start = $a[3];
 my $cigar = $a[4];
 my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
 my @c = split(/[0-9]*/, $cigar); #keeps track of letters
 my $loc = $start; #distance from start of read
 my $var_start;
 my $var_end;

 #loop over type of alignment in cigar string, build hashes of candidate locations
 for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I" 


#write out edge features
foreach my $key_chr (sort(keys %IDhash)) 

 foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr )) 

 foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

 print $fh_edge "$key_chrt$key_startt$key_endt";
 my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
 print $fh_edge "$sum,";

 for my $l ('S','I','D') 

 if (defined($IDhash$key_chr$key_start$key_end$l)) 
 print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
 else 
 print $fh_edge "0,";
 
 

 print $fh_edge "n"; 
 
 


#write out "all" features
foreach my $key_chr (sort(keys %softhash)) 

 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 

 print $fh_all "$key_chrt$key_post";
 print $fh_all "$softhash$key_chr$key_posn";

Edit: Gist!

edited Jun 21 at 17:26

asked Jun 21 at 0:51

Randoms

1334

I have a script for parsing BAM files. The script's input thus is lines like

E00489:44:HNNVYCCXX:3:1101:24890:5616 99 22 16052014 150M
E00489:44:HNNVYCCXX:1:1110:21704:27345 99 22 16052044 150M
E00489:44:HNNVYCCXX:1:1217:2372:69519 163 22 16052044 150M
E00489:44:HNNVYCCXX:3:2123:8948:16779 99 22 16052044 150M
E00489:44:HNNVYCCXX:2:2213:2920:25534 147 22 16052054 146M4S
E00489:44:HNNVYCCXX:2:2206:5020:71717 83 22 16052055 145M5S
E00489:44:HNNVYCCXX:4:2206:12642:40829 99 22 16052056 144M6S

(The BAM file is actually run through cut -f1-4,6 before input to this script so this is only a subset of fields.)

I rarely use Perl, but it seems highly inefficient. It does work as intended, but it's slow. I'd like to speed it up and also make it more readable and easy-to-follow, if possible.

#!/bin/perl

#initialize hashes
my %softhash;
my %IDhash;
my $file1Name = $ARGV[0] . 'all_sc_positions.txt';
my $file2Name = $ARGV[0] . 'edge_sc_positions.txt';
open(my $fh_all, '>', $file1Name);
open(my $fh_edge, '>', $file2Name);

#for each line
while (my $line = <STDIN>) 

 #read in line and parse
 chomp($line);
 my @a = split("t", $line);
 my $start = $a[3];
 my $cigar = $a[4];
 my @b = split(/[A-Z]/, $cigar);# keeps track of numbers
 my @c = split(/[0-9]*/, $cigar); #keeps track of letters
 my $loc = $start; #distance from start of read
 my $var_start;
 my $var_end;

 #loop over type of alignment in cigar string, build hashes of candidate locations
 for (my $i = 0; $i <= $#c; $i++) $c[$i] eq "I" 


#write out edge features
foreach my $key_chr (sort(keys %IDhash)) 

 foreach my $key_start (sort $a <=> $b (keys % $IDhash$key_chr )) 

 foreach my $key_end (sort $a <=> $b (keys % $IDhash$key_chr$key_start ))

 print $fh_edge "$key_chrt$key_startt$key_endt";
 my $sum = $IDhash$key_chr$key_start$key_endI + $IDhash$key_chr$key_start$key_endD + $IDhash$key_chr$key_start$key_endS;
 print $fh_edge "$sum,";

 for my $l ('S','I','D') 

 if (defined($IDhash$key_chr$key_start$key_end$l)) 
 print $fh_edge "$IDhash$key_chr$key_start$key_end$l,";
 else 
 print $fh_edge "0,";
 
 

 print $fh_edge "n"; 
 
 


#write out "all" features
foreach my $key_chr (sort(keys %softhash)) 

 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 

 print $fh_all "$key_chrt$key_post";
 print $fh_all "$softhash$key_chr$key_posn";

Edit: Gist!

edited Jun 21 at 17:26

asked Jun 21 at 0:51

Randoms

1334

edited Jun 21 at 17:26

asked Jun 21 at 0:51

Randoms

1334

asked Jun 21 at 0:51

Randoms

1334

asked Jun 21 at 0:51

Randoms

1334

2

Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â€“Â Phrancis
Jun 21 at 1:22

1

There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification Ã¢Â€Â” what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â€“Â 200_success
Jun 21 at 2:58

1

@randoms Good, I hope you get great answers!
â€“Â Phrancis
Jun 21 at 5:47

1

StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â€“Â Peter Taylor
Jun 21 at 15:27

2

Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â€“Â Peter Taylor
Jun 21 at 15:33

Â |Â
show 9 more comments

2

Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â€“Â Phrancis
Jun 21 at 1:22

1

There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification Ã¢Â€Â” what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â€“Â 200_success
Jun 21 at 2:58

1

@randoms Good, I hope you get great answers!
â€“Â Phrancis
Jun 21 at 5:47

1

StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â€“Â Peter Taylor
Jun 21 at 15:27

2

Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â€“Â Peter Taylor
Jun 21 at 15:33

Welcome to Code Review! You said you rarely use Perl, and you inherited this code; do you at least understand how it works?
â€“Â Phrancis
Jun 21 at 1:22

There are circumstances where reviewing inherited code is allowable: "I inherited this code, and I made this modification Ã¢Â€Â” what do you think?" However, if you are posting your predecessor's code, with no significant contribution of your own, and are asking for performance enhancements and behaviour changes, that's squarely off-topic for Code Review.
â€“Â 200_success
Jun 21 at 2:58

@randoms Good, I hope you get great answers!
â€“Â Phrancis
Jun 21 at 5:47

StackExchange converts tabs to spaces, so the example input is no use for testing. Perhaps you could link a gist, pastebin, etc.
â€“Â Peter Taylor
Jun 21 at 15:27

Also, if you can link a couple of large files (say one which takes about one minute and one which takes about 10 minutes) that would be helpful.
â€“Â Peter Taylor
Jun 21 at 15:33

Â |Â
show 9 more comments

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

Non-code / very-high-level considerations.

The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code on the 100000-line example in less than one second, whereas you say it takes you a minute. (Of course, it's still worth remembering that better algorithms amplify the performance benefit of better hardware, so I will look at the hardware too).

My experience in bioinformatics is very limited, and I've never worked with this particular format before. I have only looked briefly at the docs. However, following your link to a description of the CIGAR string, and then another link from that page to https://samtools.github.io/hts-specs/SAMv1.pdf , I observe the statement

S may only have H operations between them and the ends of the CIGAR string

However, 1368 lines of the 100000-line example violate that restriction. I suggest that you review the code to check that it doesn't rely on it, since you're the domain expert and have better understanding of what the codes mean than most if not all of us.

And I think it is worth pointing out with respect to your observation that

I rarely use Perl, but it seems highly inefficient.

There's not much code here. Perhaps you should consider porting it to a language which you know better and is more efficient? (I would guess that Python run with PyPy or Cython would be faster, and Python knowledge seems to be common enough in your field that you wouldn't be creating problems for whoever later inherits it from you).

Code considerations

Possible bugs?

 my $loc = $start; #distance from start of read
 ...
 for (my $i = 0; $i <= $#c; $i++)

If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].

In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?

Note: since the example too fast for me to get much useful profiling information out, this is based on common sense. Speed optimisation often defies common sense. I advise testing the suggestions one by one to see which work and which don't on large datasets.

The most obvious optimisation relates to softhash, and in particular the loop

 for (my $pos = $var_start; $pos < $var_end; $pos++) 
 $softhash$a[2]$pos++;

If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with

 $softhash$a[2]$var_start++;
 $softhash$a[2]$var_end--;

and then modify the output loop at the end to

foreach my $key_chr (sort(keys %softhash)) {
 my $accum = 0;
 my $prev_pos = -1;
 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 
 if ($accum > 0) 
 for (my $i = $prev_pos; $i < $key_pos; $i++) 
 print $fh_all "$key_chrt$it$accumn";
 
 
 $accum += $softhash$key_chr$key_pos;
 $prev_pos = $key_pos;

The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.

If I'm right about $loc above, it might be worth eliminating the loop over $j and replacing it with an unconditional update of $var_start and $var_end. This probably won't help performance, but it might make the code more readable.

The other things I'd do for readability mainly relate to names. I think it would be helpful to pull out $chromosome = $a[2]; I wonder whether $var_start and $var_end might be better as $range_start and $range_end; and I'm pretty sure that the key_ prefix doesn't convey much useful information. Reusing $chromosome instead of $key_chr, for example.

One other readability issue: I was puzzled by the loop for (my $i = 0; $i <= $#c; $i++) because it clearly runs one time too many. Either start the iteration at 1 or find some way to make the indexing start at 0. If you implement the suggestion about updating the range unconditionally I think you cease to need to reference old values of $b and might be able to replace the loop with while ($cigar =~ /([0-9]+)([A-Z])/g) or something similar. (Not tested).

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196933%2fhighly-nested-bioinformatics-processing%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

Non-code / very-high-level considerations.

S may only have H operations between them and the ends of the CIGAR string

And I think it is worth pointing out with respect to your observation that

I rarely use Perl, but it seems highly inefficient.

Code considerations

Possible bugs?

 my $loc = $start; #distance from start of read
 ...
 for (my $i = 0; $i <= $#c; $i++)

If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].

In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?

The most obvious optimisation relates to softhash, and in particular the loop

 for (my $pos = $var_start; $pos < $var_end; $pos++) 
 $softhash$a[2]$pos++;

If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with

 $softhash$a[2]$var_start++;
 $softhash$a[2]$var_end--;

and then modify the output loop at the end to

foreach my $key_chr (sort(keys %softhash)) {
 my $accum = 0;
 my $prev_pos = -1;
 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 
 if ($accum > 0) 
 for (my $i = $prev_pos; $i < $key_pos; $i++) 
 print $fh_all "$key_chrt$it$accumn";
 
 
 $accum += $softhash$key_chr$key_pos;
 $prev_pos = $key_pos;

The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

add a commentÂ |Â

up vote
4
down vote

accepted

Non-code / very-high-level considerations.

S may only have H operations between them and the ends of the CIGAR string

And I think it is worth pointing out with respect to your observation that

I rarely use Perl, but it seems highly inefficient.

Code considerations

Possible bugs?

 my $loc = $start; #distance from start of read
 ...
 for (my $i = 0; $i <= $#c; $i++)

If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].

In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?

The most obvious optimisation relates to softhash, and in particular the loop

 for (my $pos = $var_start; $pos < $var_end; $pos++) 
 $softhash$a[2]$pos++;

If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with

 $softhash$a[2]$var_start++;
 $softhash$a[2]$var_end--;

and then modify the output loop at the end to

foreach my $key_chr (sort(keys %softhash)) {
 my $accum = 0;
 my $prev_pos = -1;
 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 
 if ($accum > 0) 
 for (my $i = $prev_pos; $i < $key_pos; $i++) 
 print $fh_all "$key_chrt$it$accumn";
 
 
 $accum += $softhash$key_chr$key_pos;
 $prev_pos = $key_pos;

The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

add a commentÂ |Â

up vote
4
down vote

accepted

Non-code / very-high-level considerations.

S may only have H operations between them and the ends of the CIGAR string

And I think it is worth pointing out with respect to your observation that

I rarely use Perl, but it seems highly inefficient.

Code considerations

Possible bugs?

 my $loc = $start; #distance from start of read
 ...
 for (my $i = 0; $i <= $#c; $i++)

If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].

In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?

The most obvious optimisation relates to softhash, and in particular the loop

 for (my $pos = $var_start; $pos < $var_end; $pos++) 
 $softhash$a[2]$pos++;

If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with

 $softhash$a[2]$var_start++;
 $softhash$a[2]$var_end--;

and then modify the output loop at the end to

foreach my $key_chr (sort(keys %softhash)) {
 my $accum = 0;
 my $prev_pos = -1;
 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 
 if ($accum > 0) 
 for (my $i = $prev_pos; $i < $key_pos; $i++) 
 print $fh_all "$key_chrt$it$accumn";
 
 
 $accum += $softhash$key_chr$key_pos;
 $prev_pos = $key_pos;

The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

Non-code / very-high-level considerations.

S may only have H operations between them and the ends of the CIGAR string

And I think it is worth pointing out with respect to your observation that

I rarely use Perl, but it seems highly inefficient.

Code considerations

Possible bugs?

 my $loc = $start; #distance from start of read
 ...
 for (my $i = 0; $i <= $#c; $i++)

If I'm correctly interpreting things, shouldn't $loc = $start be just before the loop over $j? Otherwise a CIGAR string with several I/D/S will be double-counting some of the $b[$j].

In addition, I really can't understand how I and D can be treated identically. Shouldn't one of them cause $var_start to go down?

The most obvious optimisation relates to softhash, and in particular the loop

 for (my $pos = $var_start; $pos < $var_end; $pos++) 
 $softhash$a[2]$pos++;

If the ranges tend to be moderately wide then this is doing a lot of work and is ripe for optimisation. Specifically, you can replace that loop with

 $softhash$a[2]$var_start++;
 $softhash$a[2]$var_end--;

and then modify the output loop at the end to

foreach my $key_chr (sort(keys %softhash)) {
 my $accum = 0;
 my $prev_pos = -1;
 foreach my $key_pos (sort $a <=> $b (keys % $softhash$key_chr )) 
 if ($accum > 0) 
 for (my $i = $prev_pos; $i < $key_pos; $i++) 
 print $fh_all "$key_chrt$it$accumn";
 
 
 $accum += $softhash$key_chr$key_pos;
 $prev_pos = $key_pos;

The main processing should be much faster than before, and the output loop should be slightly faster because (keys % $softhash$key_chr ) has fewer items to sort.

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

edited Jun 21 at 18:58

answered Jun 21 at 18:50

Peter Taylor

14k2454

answered Jun 21 at 18:50

Peter Taylor

14k2454

answered Jun 21 at 18:50

Peter Taylor

14k2454

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

add a commentÂ |Â

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

These are all great tips! I've implemented pretty much all of themÃ¢Â€Â”and in Python, as you suggested.
â€“Â Randoms
Jun 27 at 16:54

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr