Haskell sentence segregation

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I am trying to implement sentence segregation using Haskell, I have achieved a decent bulk of it using the NLP.FullStop library, but this doesn't seem to account for sentences with full stops at the end of quotes like this." or like this.', or at the end of bracketed sentences like this.) I also want to deal with the character Ã¢Â€Â much in the same way as ", as a lot of the content I am dealing with uses this character. I've been unable to get a successful regex match on this character, so have resorted to replacing it with " before the regex...

import qualified Data.ByteString.Char8 as BC
import Data.List.Split
import qualified NLP.FullStop as FS

splitter :: String -> [String]
splitter = concatMap FS.segment . splitPunc
 where splitPunc = map unwords . split puncSplitter . words
 puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool)
 splitPrep = replace_ 'Ã¢Â€Â' '"'
 puncExpr = "\.[)'"][^w]?$" :: String

replace_ :: Eq b => b -> b -> [b] -> [b]
replace_ a b = map (x -> if (a == x) then b else x)

asked Mar 20 at 1:03

danbroooks

1608

1

Does your code as posted work correctly to accomplish the task?
â€“Â Phrancis
Mar 20 at 3:42

Yes, the text in my post is hopefully to give some context around why I have done certain things in this code, hopefully to aid whoever reads it, as it is not very readable to me
â€“Â danbroooks
Mar 20 at 8:41

Your code is missing at least one include for =~.
â€“Â Zeta
Mar 29 at 9:07

add a commentÂ |Â

up vote
2
down vote

favorite

import qualified Data.ByteString.Char8 as BC
import Data.List.Split
import qualified NLP.FullStop as FS

splitter :: String -> [String]
splitter = concatMap FS.segment . splitPunc
 where splitPunc = map unwords . split puncSplitter . words
 puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool)
 splitPrep = replace_ 'Ã¢Â€Â' '"'
 puncExpr = "\.[)'"][^w]?$" :: String

replace_ :: Eq b => b -> b -> [b] -> [b]
replace_ a b = map (x -> if (a == x) then b else x)

asked Mar 20 at 1:03

danbroooks

1608

1

Does your code as posted work correctly to accomplish the task?
â€“Â Phrancis
Mar 20 at 3:42

Yes, the text in my post is hopefully to give some context around why I have done certain things in this code, hopefully to aid whoever reads it, as it is not very readable to me
â€“Â danbroooks
Mar 20 at 8:41

Your code is missing at least one include for =~.
â€“Â Zeta
Mar 29 at 9:07

add a commentÂ |Â

up vote
2
down vote

favorite

import qualified Data.ByteString.Char8 as BC
import Data.List.Split
import qualified NLP.FullStop as FS

splitter :: String -> [String]
splitter = concatMap FS.segment . splitPunc
 where splitPunc = map unwords . split puncSplitter . words
 puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool)
 splitPrep = replace_ 'Ã¢Â€Â' '"'
 puncExpr = "\.[)'"][^w]?$" :: String

replace_ :: Eq b => b -> b -> [b] -> [b]
replace_ a b = map (x -> if (a == x) then b else x)

asked Mar 20 at 1:03

danbroooks

1608

import qualified Data.ByteString.Char8 as BC
import Data.List.Split
import qualified NLP.FullStop as FS

splitter :: String -> [String]
splitter = concatMap FS.segment . splitPunc
 where splitPunc = map unwords . split puncSplitter . words
 puncSplitter = keepDelimsR $ whenElt (word -> BC.pack (splitPrep word) =~ puncExpr :: Bool)
 splitPrep = replace_ 'Ã¢Â€Â' '"'
 puncExpr = "\.[)'"][^w]?$" :: String

replace_ :: Eq b => b -> b -> [b] -> [b]
replace_ a b = map (x -> if (a == x) then b else x)

asked Mar 20 at 1:03

danbroooks

1608

asked Mar 20 at 1:03

danbroooks

1608

asked Mar 20 at 1:03

danbroooks

1608

asked Mar 20 at 1:03

danbroooks

1608

1

Does your code as posted work correctly to accomplish the task?
â€“Â Phrancis
Mar 20 at 3:42

Yes, the text in my post is hopefully to give some context around why I have done certain things in this code, hopefully to aid whoever reads it, as it is not very readable to me
â€“Â danbroooks
Mar 20 at 8:41

Your code is missing at least one include for =~.
â€“Â Zeta
Mar 29 at 9:07

add a commentÂ |Â

1

Does your code as posted work correctly to accomplish the task?
â€“Â Phrancis
Mar 20 at 3:42

Yes, the text in my post is hopefully to give some context around why I have done certain things in this code, hopefully to aid whoever reads it, as it is not very readable to me
â€“Â danbroooks
Mar 20 at 8:41

Your code is missing at least one include for =~.
â€“Â Zeta
Mar 29 at 9:07

Does your code as posted work correctly to accomplish the task?
â€“Â Phrancis
Mar 20 at 3:42

Yes, the text in my post is hopefully to give some context around why I have done certain things in this code, hopefully to aid whoever reads it, as it is not very readable to me
â€“Â danbroooks
Mar 20 at 8:41

Your code is missing at least one include for =~.
â€“Â Zeta
Mar 29 at 9:07

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

While your code works and uses type signatures, it's missing documentation. It's not clear from your description or your code what splitter's intended result will be on a given input. Documentation and tests are therefore highly welcome.

Also, it's not clear why you've added an underscore to replace_. And your code is missing at least one include for =~. I assume that you just forgot to include that import line in your question and it is in your actual code.

That being said, the fullstop library isÃ¢Â€Â”according to its own documentationÃ¢Â€Â”a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

Unfortunately, you cannot expand stopPunctuation, since content in parentheses (like this) does not lead to a new sentence. Note that .) and ." aren't valid in some languages, though, they require ). and "., so it's not clear what you try to achieve there (see comment above documentation above).

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

answered Mar 29 at 9:17

Zeta

14.3k23267

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f189989%2fhaskell-sentence-segregation%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

That being said, the fullstop library isÃ¢Â€Â”according to its own documentationÃ¢Â€Â”a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

answered Mar 29 at 9:17

Zeta

14.3k23267

add a commentÂ |Â

up vote
1
down vote

That being said, the fullstop library isÃ¢Â€Â”according to its own documentationÃ¢Â€Â”a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

answered Mar 29 at 9:17

Zeta

14.3k23267

add a commentÂ |Â

up vote
1
down vote

That being said, the fullstop library isÃ¢Â€Â”according to its own documentationÃ¢Â€Â”a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

answered Mar 29 at 9:17

Zeta

14.3k23267

That being said, the fullstop library isÃ¢Â€Â”according to its own documentationÃ¢Â€Â”a placeholder library:

Note that this package is mostly a placeholder. I hope the Haskell/NLP
communities will run with it and upload a more sophisticated (family
of) segmenter(s) in its place. Patches (and new maintainers) would be
greeted with delight!

Your quarrel about the line endings also comes from segment, since it hard-codes the allowed punctuations:

-- https://hackage.haskell.org/package/fullstop-0.1.4/docs/src/NLP-FullStop.html#stopPunctuation
stopPunctuation :: [Char]
stopPunctuation = [ '.', '?', '!' ] -- <<<<

So all in all, well written, but without additional explanation or documentation there is no way to check whether the function actually does what you want. I also suggest you to add some tests.

answered Mar 29 at 9:17

Zeta

14.3k23267

answered Mar 29 at 9:17

Zeta

14.3k23267

answered Mar 29 at 9:17

Zeta

14.3k23267

answered Mar 29 at 9:17

Zeta

14.3k23267

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr