Smith-Waterman algorithm for local sequence alignment with affine gap penalties

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.

This is about determining the (score / edit distance of the) most similar substrings of two given strings.

For example, given the two sequences

TGTTACGG and
GGTTGACTA

the optimal (w.r.t. to a certain scoring function) local alignment would be

GTTA-C
GTTGAC

I am using a slightly different way to score introduction and continuation of gaps than in the Wikipedia article. Technically, in addition to one scoring matrix / edit graph, you have two more - each accounting for gaps in one of the input words.

Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:

recurrence

With rho penalty for introducing a gap; sigma penalty for continuing a gap; and delta a function that assigns a score to two matched characters (e.g. 1 if equal, -1 if unequal). The higher the score the better.

This can be solved using Dynamic Programming.

On the basis of this blog article I wrote the following Haskell code:

import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
 where (m, n) = (length a, length b)

 -- use input as arrays for fast direct indexing
 v = Array.listArray (1, m) a
 w = Array.listArray (1, n) b

 -- declare scoring fns for each "edit graph level"
 scmid i 0 = 0
 scmid 0 j = 0
 scmid i j = maximum [ 0 -- start here ("free" edges from (0,0)) 
 ,(mid ! (i-1, j-1)) + ÃŽÂ´ (v!i) (w!j) -- match or mismatch
 , low ! (i , j) -- end deletion / gap
 , upp ! (i , j) -- end insertion / gap
 ]

 scupp i 0 = 0
 scupp 0 j = 0
 scupp i j = maximum [
 (upp ! (i, j-1)) - ÃÂƒ -- cont gap in v
 , (mid ! (i, j-1)) - ÃÂ -- start gap in v
 ]

 sclow i 0 = 0
 sclow 0 j = 0
 sclow i j = maximum [
 (low ! (i-1, j)) - ÃÂƒ -- cont gap in w
 , (mid ! (i-1, j)) - ÃÂ -- start gap in w
 ]

 -- declare content of "edit graph levels"
 mid = Array.listArray bounds
 [scmid i j | (i, j) <- Array.range bounds]
 low = Array.listArray bounds
 [sclow i j | (i, j) <- Array.range bounds]
 upp = Array.listArray bounds
 [scupp i j | (i, j) <- Array.range bounds]

 bounds = ((0,0), (m,n))

 ÃŽÂ´ v w | v == w = 1 -- match
 | otherwise = -2 -- mismatch

 ÃÂ = 5
 ÃÂƒ = 0.5

Question

I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?

edited May 24 at 9:21

Mast

7,32663484

asked May 23 at 13:09

ngmir

112

Why does the scoring function prefer the A before the gap and not after the gap?
â€“Â Vogel612â™¦
May 23 at 13:16

2

Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
â€“Â Dannnno
May 23 at 13:48

I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
â€“Â Mast
May 24 at 9:22

add a commentÂ |Â

up vote
2
down vote

favorite

The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.

This is about determining the (score / edit distance of the) most similar substrings of two given strings.

For example, given the two sequences

TGTTACGG and
GGTTGACTA

the optimal (w.r.t. to a certain scoring function) local alignment would be

GTTA-C
GTTGAC

Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:

recurrence

This can be solved using Dynamic Programming.

On the basis of this blog article I wrote the following Haskell code:

import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
 where (m, n) = (length a, length b)

 -- use input as arrays for fast direct indexing
 v = Array.listArray (1, m) a
 w = Array.listArray (1, n) b

 -- declare scoring fns for each "edit graph level"
 scmid i 0 = 0
 scmid 0 j = 0
 scmid i j = maximum [ 0 -- start here ("free" edges from (0,0)) 
 ,(mid ! (i-1, j-1)) + ÃŽÂ´ (v!i) (w!j) -- match or mismatch
 , low ! (i , j) -- end deletion / gap
 , upp ! (i , j) -- end insertion / gap
 ]

 scupp i 0 = 0
 scupp 0 j = 0
 scupp i j = maximum [
 (upp ! (i, j-1)) - ÃÂƒ -- cont gap in v
 , (mid ! (i, j-1)) - ÃÂ -- start gap in v
 ]

 sclow i 0 = 0
 sclow 0 j = 0
 sclow i j = maximum [
 (low ! (i-1, j)) - ÃÂƒ -- cont gap in w
 , (mid ! (i-1, j)) - ÃÂ -- start gap in w
 ]

 -- declare content of "edit graph levels"
 mid = Array.listArray bounds
 [scmid i j | (i, j) <- Array.range bounds]
 low = Array.listArray bounds
 [sclow i j | (i, j) <- Array.range bounds]
 upp = Array.listArray bounds
 [scupp i j | (i, j) <- Array.range bounds]

 bounds = ((0,0), (m,n))

 ÃŽÂ´ v w | v == w = 1 -- match
 | otherwise = -2 -- mismatch

 ÃÂ = 5
 ÃÂƒ = 0.5

Question

I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?

edited May 24 at 9:21

Mast

7,32663484

asked May 23 at 13:09

ngmir

112

Why does the scoring function prefer the A before the gap and not after the gap?
â€“Â Vogel612â™¦
May 23 at 13:16

2

Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
â€“Â Dannnno
May 23 at 13:48

I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
â€“Â Mast
May 24 at 9:22

add a commentÂ |Â

up vote
2
down vote

favorite

The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.

This is about determining the (score / edit distance of the) most similar substrings of two given strings.

For example, given the two sequences

TGTTACGG and
GGTTGACTA

the optimal (w.r.t. to a certain scoring function) local alignment would be

GTTA-C
GTTGAC

Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:

recurrence

This can be solved using Dynamic Programming.

On the basis of this blog article I wrote the following Haskell code:

import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
 where (m, n) = (length a, length b)

 -- use input as arrays for fast direct indexing
 v = Array.listArray (1, m) a
 w = Array.listArray (1, n) b

 -- declare scoring fns for each "edit graph level"
 scmid i 0 = 0
 scmid 0 j = 0
 scmid i j = maximum [ 0 -- start here ("free" edges from (0,0)) 
 ,(mid ! (i-1, j-1)) + ÃŽÂ´ (v!i) (w!j) -- match or mismatch
 , low ! (i , j) -- end deletion / gap
 , upp ! (i , j) -- end insertion / gap
 ]

 scupp i 0 = 0
 scupp 0 j = 0
 scupp i j = maximum [
 (upp ! (i, j-1)) - ÃÂƒ -- cont gap in v
 , (mid ! (i, j-1)) - ÃÂ -- start gap in v
 ]

 sclow i 0 = 0
 sclow 0 j = 0
 sclow i j = maximum [
 (low ! (i-1, j)) - ÃÂƒ -- cont gap in w
 , (mid ! (i-1, j)) - ÃÂ -- start gap in w
 ]

 -- declare content of "edit graph levels"
 mid = Array.listArray bounds
 [scmid i j | (i, j) <- Array.range bounds]
 low = Array.listArray bounds
 [sclow i j | (i, j) <- Array.range bounds]
 upp = Array.listArray bounds
 [scupp i j | (i, j) <- Array.range bounds]

 bounds = ((0,0), (m,n))

 ÃŽÂ´ v w | v == w = 1 -- match
 | otherwise = -2 -- mismatch

 ÃÂ = 5
 ÃÂƒ = 0.5

Question

I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?

edited May 24 at 9:21

Mast

7,32663484

asked May 23 at 13:09

ngmir

112

The following is a summary from the top of my head so please correct me if I'm wrong. See the Wikipedia article for more.

This is about determining the (score / edit distance of the) most similar substrings of two given strings.

For example, given the two sequences

TGTTACGG and
GGTTGACTA

the optimal (w.r.t. to a certain scoring function) local alignment would be

GTTA-C
GTTGAC

Given two input words a = a1 a2, ..., an, b = b1 b2 ... bm, let s_ij denote the score of the optimal local alignment of a1..ai and b1..bj. This then gives the following recurrence:

recurrence

This can be solved using Dynamic Programming.

On the basis of this blog article I wrote the following Haskell code:

import qualified Data.Array as Array
import Data.Array

scoreLocal :: String -> String -> Float
scoreLocal a b = maximum mid
 where (m, n) = (length a, length b)

 -- use input as arrays for fast direct indexing
 v = Array.listArray (1, m) a
 w = Array.listArray (1, n) b

 -- declare scoring fns for each "edit graph level"
 scmid i 0 = 0
 scmid 0 j = 0
 scmid i j = maximum [ 0 -- start here ("free" edges from (0,0)) 
 ,(mid ! (i-1, j-1)) + ÃŽÂ´ (v!i) (w!j) -- match or mismatch
 , low ! (i , j) -- end deletion / gap
 , upp ! (i , j) -- end insertion / gap
 ]

 scupp i 0 = 0
 scupp 0 j = 0
 scupp i j = maximum [
 (upp ! (i, j-1)) - ÃÂƒ -- cont gap in v
 , (mid ! (i, j-1)) - ÃÂ -- start gap in v
 ]

 sclow i 0 = 0
 sclow 0 j = 0
 sclow i j = maximum [
 (low ! (i-1, j)) - ÃÂƒ -- cont gap in w
 , (mid ! (i-1, j)) - ÃÂ -- start gap in w
 ]

 -- declare content of "edit graph levels"
 mid = Array.listArray bounds
 [scmid i j | (i, j) <- Array.range bounds]
 low = Array.listArray bounds
 [sclow i j | (i, j) <- Array.range bounds]
 upp = Array.listArray bounds
 [scupp i j | (i, j) <- Array.range bounds]

 bounds = ((0,0), (m,n))

 ÃŽÂ´ v w | v == w = 1 -- match
 | otherwise = -2 -- mismatch

 ÃÂ = 5
 ÃÂƒ = 0.5

Question

I dont have much experience yet with Haskell. Can (parts of this) be written in a nicer way?

edited May 24 at 9:21

Mast

7,32663484

asked May 23 at 13:09

ngmir

112

edited May 24 at 9:21

Mast

7,32663484

edited May 24 at 9:21

Mast

7,32663484

edited May 24 at 9:21

Mast

7,32663484

asked May 23 at 13:09

ngmir

112

asked May 23 at 13:09

ngmir

112

asked May 23 at 13:09

ngmir

112

Why does the scoring function prefer the A before the gap and not after the gap?
â€“Â Vogel612â™¦
May 23 at 13:16

2

Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
â€“Â Dannnno
May 23 at 13:48

I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
â€“Â Mast
May 24 at 9:22

add a commentÂ |Â

Why does the scoring function prefer the A before the gap and not after the gap?
â€“Â Vogel612â™¦
May 23 at 13:16

2

Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
â€“Â Dannnno
May 23 at 13:48

I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
â€“Â Mast
May 24 at 9:22

Why does the scoring function prefer the A before the gap and not after the gap?
â€“Â Vogel612â™¦
May 23 at 13:16

Your second question, of how to modify the code to compute that value, is off-topic; we only review existing code, we don't enhance or provide enhancements to the functionality.
â€“Â Dannnno
May 23 at 13:48

I removed the off-topic part of your question, it was either that or getting the question closed. Please take a look at the help center. Feature requests are not something we do.
â€“Â Mast
May 24 at 9:22

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195020%2fsmith-waterman-algorithm-for-local-sequence-alignment-with-affine-gap-penalties%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr