Haskell searching through a large textfile efficiently

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.



import Data.Char
import System.IO.Strict
import qualified Data.Set as Set

vowel x = elem x "aeiou"

toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word


wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList

good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet

mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)


Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.



Now, I have three main concerns:



  1. When I run my code with a toyWordSet, it runs fine but I ultimately intend to replace it with /usr/share/dict/words, on which my PC freezes. How do I make my program more efficient?

  2. Writing a function like mapMBool is not the most elegant idea ever. How can I do this more elegantly?

  3. For some reason, I get some strange exceptions, when I try to use the lazy readFile instead of the strict version. What is happening? Why must I use strictness?

Other general comments on the coding style or practice is also welcome.







share|improve this question





















  • mapMBool is just mapM.
    – Gurkenglas
    Jan 8 at 19:14
















up vote
1
down vote

favorite












I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.



import Data.Char
import System.IO.Strict
import qualified Data.Set as Set

vowel x = elem x "aeiou"

toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word


wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList

good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet

mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)


Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.



Now, I have three main concerns:



  1. When I run my code with a toyWordSet, it runs fine but I ultimately intend to replace it with /usr/share/dict/words, on which my PC freezes. How do I make my program more efficient?

  2. Writing a function like mapMBool is not the most elegant idea ever. How can I do this more elegantly?

  3. For some reason, I get some strange exceptions, when I try to use the lazy readFile instead of the strict version. What is happening? Why must I use strictness?

Other general comments on the coding style or practice is also welcome.







share|improve this question





















  • mapMBool is just mapM.
    – Gurkenglas
    Jan 8 at 19:14












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.



import Data.Char
import System.IO.Strict
import qualified Data.Set as Set

vowel x = elem x "aeiou"

toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word


wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList

good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet

mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)


Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.



Now, I have three main concerns:



  1. When I run my code with a toyWordSet, it runs fine but I ultimately intend to replace it with /usr/share/dict/words, on which my PC freezes. How do I make my program more efficient?

  2. Writing a function like mapMBool is not the most elegant idea ever. How can I do this more elegantly?

  3. For some reason, I get some strange exceptions, when I try to use the lazy readFile instead of the strict version. What is happening? Why must I use strictness?

Other general comments on the coding style or practice is also welcome.







share|improve this question













I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.



import Data.Char
import System.IO.Strict
import qualified Data.Set as Set

vowel x = elem x "aeiou"

toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word


wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList

good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet

mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)


Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.



Now, I have three main concerns:



  1. When I run my code with a toyWordSet, it runs fine but I ultimately intend to replace it with /usr/share/dict/words, on which my PC freezes. How do I make my program more efficient?

  2. Writing a function like mapMBool is not the most elegant idea ever. How can I do this more elegantly?

  3. For some reason, I get some strange exceptions, when I try to use the lazy readFile instead of the strict version. What is happening? Why must I use strictness?

Other general comments on the coding style or practice is also welcome.









share|improve this question












share|improve this question




share|improve this question








edited Jan 8 at 18:40









200_success

123k14143401




123k14143401









asked Jan 8 at 16:53









Agnishom Chattopadhyay

21818




21818











  • mapMBool is just mapM.
    – Gurkenglas
    Jan 8 at 19:14
















  • mapMBool is just mapM.
    – Gurkenglas
    Jan 8 at 19:14















mapMBool is just mapM.
– Gurkenglas
Jan 8 at 19:14




mapMBool is just mapM.
– Gurkenglas
Jan 8 at 19:14










1 Answer
1






active

oldest

votes

















up vote
5
down vote



accepted










I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.



Write your functions as though the external world doesn't exist. E.g.—



-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set


Write your actions to set up the values you need to use your IO-ignorant functions. E.g.—



main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need


When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.






share|improve this answer





















  • This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
    – Agnishom Chattopadhyay
    Jan 9 at 13:14










  • Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
    – bisserlis
    Jan 9 at 18:42










  • I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
    – Agnishom Chattopadhyay
    Jan 10 at 2:15






  • 1




    Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
    – bisserlis
    Jan 10 at 4:20










  • That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
    – Agnishom Chattopadhyay
    Jan 10 at 4:33










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184596%2fhaskell-searching-through-a-large-textfile-efficiently%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
5
down vote



accepted










I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.



Write your functions as though the external world doesn't exist. E.g.—



-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set


Write your actions to set up the values you need to use your IO-ignorant functions. E.g.—



main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need


When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.






share|improve this answer





















  • This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
    – Agnishom Chattopadhyay
    Jan 9 at 13:14










  • Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
    – bisserlis
    Jan 9 at 18:42










  • I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
    – Agnishom Chattopadhyay
    Jan 10 at 2:15






  • 1




    Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
    – bisserlis
    Jan 10 at 4:20










  • That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
    – Agnishom Chattopadhyay
    Jan 10 at 4:33














up vote
5
down vote



accepted










I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.



Write your functions as though the external world doesn't exist. E.g.—



-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set


Write your actions to set up the values you need to use your IO-ignorant functions. E.g.—



main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need


When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.






share|improve this answer





















  • This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
    – Agnishom Chattopadhyay
    Jan 9 at 13:14










  • Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
    – bisserlis
    Jan 9 at 18:42










  • I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
    – Agnishom Chattopadhyay
    Jan 10 at 2:15






  • 1




    Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
    – bisserlis
    Jan 10 at 4:20










  • That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
    – Agnishom Chattopadhyay
    Jan 10 at 4:33












up vote
5
down vote



accepted







up vote
5
down vote



accepted






I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.



Write your functions as though the external world doesn't exist. E.g.—



-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set


Write your actions to set up the values you need to use your IO-ignorant functions. E.g.—



main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need


When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.






share|improve this answer













I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.



Write your functions as though the external world doesn't exist. E.g.—



-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set


Write your actions to set up the values you need to use your IO-ignorant functions. E.g.—



main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need


When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.







share|improve this answer













share|improve this answer



share|improve this answer











answered Jan 8 at 17:39









bisserlis

2,7611614




2,7611614











  • This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
    – Agnishom Chattopadhyay
    Jan 9 at 13:14










  • Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
    – bisserlis
    Jan 9 at 18:42










  • I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
    – Agnishom Chattopadhyay
    Jan 10 at 2:15






  • 1




    Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
    – bisserlis
    Jan 10 at 4:20










  • That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
    – Agnishom Chattopadhyay
    Jan 10 at 4:33
















  • This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
    – Agnishom Chattopadhyay
    Jan 9 at 13:14










  • Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
    – bisserlis
    Jan 9 at 18:42










  • I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
    – Agnishom Chattopadhyay
    Jan 10 at 2:15






  • 1




    Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
    – bisserlis
    Jan 10 at 4:20










  • That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
    – Agnishom Chattopadhyay
    Jan 10 at 4:33















This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
– Agnishom Chattopadhyay
Jan 9 at 13:14




This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
– Agnishom Chattopadhyay
Jan 9 at 13:14












Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
– bisserlis
Jan 9 at 18:42




Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
– bisserlis
Jan 9 at 18:42












I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
– Agnishom Chattopadhyay
Jan 10 at 2:15




I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
– Agnishom Chattopadhyay
Jan 10 at 2:15




1




1




Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
– bisserlis
Jan 10 at 4:20




Humm... there's something strange going on. Change your last line to mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
– bisserlis
Jan 10 at 4:20












That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
– Agnishom Chattopadhyay
Jan 10 at 4:33




That works. Thanks. I also tried sequence_ $ map putStrLn pigHits. That works too
– Agnishom Chattopadhyay
Jan 10 at 4:33












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184596%2fhaskell-searching-through-a-large-textfile-efficiently%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods