Smith-Waterman algorithm for local sequence alignment with affine gap penalties

Multi tool use
Multi tool use

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question





















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22
















up vote
2
down vote

favorite












The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question





















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22












up vote
2
down vote

favorite









up vote
2
down vote

favorite











The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?







share|improve this question













The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.



This is about determining the (score / edit distance of the) most similar substrings of two given strings.



For example, given the two sequences



TGTTACGG and
GGTTGACTA


the optimal (w.r.t. to a certain scoring function) local alignment would be



GTTA-C
GTTGAC


I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.



Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:



recurrence



With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.



This can be solved using Dynamic Programming.



On the basis of this blog article I wrote the following Haskell code:



import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
where (m, n) = (length a, length b)

-- use input as arrays for fast direct indexing
v = Array.listArray (1, m) a
w = Array.listArray (1, n) b

-- declare scoring fns for each "edit graph level"
scmid i 0 = 0
scmid 0 j = 0
scmid i j = maximum [ 0 -- start here ("free" edges from (0,0))
,(mid ! (i-1, j-1)) + δ (v!i) (w!j) -- match or mismatch
, low ! (i , j) -- end deletion / gap
, upp ! (i , j) -- end insertion / gap
]

scupp i 0 = 0
scupp 0 j = 0
scupp i j = maximum [
(upp ! (i, j-1)) - σ -- cont gap in v
, (mid ! (i, j-1)) - ρ -- start gap in v
]

sclow i 0 = 0
sclow 0 j = 0
sclow i j = maximum [
(low ! (i-1, j)) - σ -- cont gap in w
, (mid ! (i-1, j)) - ρ -- start gap in w
]

-- declare content of "edit graph levels"
mid = Array.listArray bounds
[scmid i j | (i, j) <- Array.range bounds]
low = Array.listArray bounds
[sclow i j | (i, j) <- Array.range bounds]
upp = Array.listArray bounds
[scupp i j | (i, j) <- Array.range bounds]

bounds = ((0,0), (m,n))

δ v w | v == w = 1 -- match
| otherwise = -2 -- mismatch

ρ = 5
σ = 0.5


Question



I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?









share|improve this question












share|improve this question




share|improve this question








edited May 24 at 9:21









Mast

7,32663484




7,32663484









asked May 23 at 13:09









ngmir

112




112











  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22
















  • Why does the scoring function prefer the A before the gap and not after the gap?
    – Vogel612♦
    May 23 at 13:16






  • 2




    Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
    – Dannnno
    May 23 at 13:48










  • I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
    – Mast
    May 24 at 9:22















Why does the scoring function prefer the A before the gap and not after the gap?
– Vogel612♦
May 23 at 13:16




Why does the scoring function prefer the A before the gap and not after the gap?
– Vogel612♦
May 23 at 13:16




2




2




Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
– Dannnno
May 23 at 13:48




Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
– Dannnno
May 23 at 13:48












I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
– Mast
May 24 at 9:22




I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
– Mast
May 24 at 9:22















active

oldest

votes











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195020%2fsmith-waterman-algorithm-for-local-sequence-alignment-with-affine-gap-penalties%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195020%2fsmith-waterman-algorithm-for-local-sequence-alignment-with-affine-gap-penalties%23new-answer', 'question_page');

);

Post as a guest













































































BFMOeUEAMzyv3VzGu2gHVjrhoIZQXj
ybm9,ZYnJ5uPzA6,yfu0hw kwYnNMyJ,YsP9fP7Bg HBh9MO

Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Read an image with ADNS2610 optical sensor and Arduino Uno