Smith-Waterman algorithm for local sequence alignment with affine gap penalties

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question





















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22
















up vote
2
down vote

favorite












The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question





















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22












up vote
2
down vote

favorite









up vote
2
down vote

favorite











The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question













The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?









share|improve this question












share|improve this question




share|improve this question








edited May 24 at 9:21









Mast

7,32663484




7,32663484









asked May 23 at 13:09









ngmir

112




112











  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22
















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22















Why does the scoring function prefer the A before the gap and not after the gap?
– Vogel612♦
May 23 at 13:16




Why does the scoring function prefer the A before the gap and not after the gap?
– Vogel612♦
May 23 at 13:16




2




2




Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
– Dannnno
May 23 at 13:48




Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
– Dannnno
May 23 at 13:48












I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
– Mast
May 24 at 9:22




I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
– Mast
May 24 at 9:22















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195020%2fsmith-waterman-algorithm-for-local-sequence-alignment-with-affine-gap-penalties%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195020%2fsmith-waterman-algorithm-for-local-sequence-alignment-with-affine-gap-penalties%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation