Haskell searching through a large textfile efficiently

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.
import Data.Char
import System.IO.Strict
import qualified Data.Set as Set
vowel x = elem x "aeiou"
toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word
wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList
good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet
mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)
Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.
Now, I have three main concerns:
- When I run my code with a
toyWordSet, it runs fine but I ultimately intend to replace it with/usr/share/dict/words, on which my PC freezes. How do I make my program more efficient? - Writing a function like
mapMBoolis not the most elegant idea ever. How can I do this more elegantly? - For some reason, I get some strange exceptions, when I try to use the lazy
readFileinstead of the strict version. What is happening? Why must I use strictness?
Other general comments on the coding style or practice is also welcome.
performance haskell file
add a comment |Â
up vote
1
down vote
favorite
I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.
import Data.Char
import System.IO.Strict
import qualified Data.Set as Set
vowel x = elem x "aeiou"
toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word
wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList
good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet
mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)
Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.
Now, I have three main concerns:
- When I run my code with a
toyWordSet, it runs fine but I ultimately intend to replace it with/usr/share/dict/words, on which my PC freezes. How do I make my program more efficient? - Writing a function like
mapMBoolis not the most elegant idea ever. How can I do this more elegantly? - For some reason, I get some strange exceptions, when I try to use the lazy
readFileinstead of the strict version. What is happening? Why must I use strictness?
Other general comments on the coding style or practice is also welcome.
performance haskell file
mapMBoolis justmapM.
â Gurkenglas
Jan 8 at 19:14
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.
import Data.Char
import System.IO.Strict
import qualified Data.Set as Set
vowel x = elem x "aeiou"
toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word
wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList
good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet
mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)
Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.
Now, I have three main concerns:
- When I run my code with a
toyWordSet, it runs fine but I ultimately intend to replace it with/usr/share/dict/words, on which my PC freezes. How do I make my program more efficient? - Writing a function like
mapMBoolis not the most elegant idea ever. How can I do this more elegantly? - For some reason, I get some strange exceptions, when I try to use the lazy
readFileinstead of the strict version. What is happening? Why must I use strictness?
Other general comments on the coding style or practice is also welcome.
performance haskell file
I have been given a list xs and a function f, and I want to know stuff like how many x <- xs satisfy (f x) `elem` xs or which ones do, etc. However, the problem is that the list xs is available only as an external file.
import Data.Char
import System.IO.Strict
import qualified Data.Set as Set
vowel x = elem x "aeiou"
toPigLatin :: String -> String
toPigLatin word
| vowel (head word) = word ++ "ay"
| not $ null [x | x <- word, vowel x] = let (v, c) = span (not.vowel) word in c ++ v ++ "ay"
| otherwise = word
wordList = lines <$> System.IO.Strict.readFile "toyWordSet"
wordSet = Set.fromList <$> wordList
good :: String -> IO Bool
good word = (Set.member (toPigLatin word)) <$> wordSet
mapMBool :: (a -> IO b) -> [a] -> IO [b]
mapMBool f l = do
case l of
-> return
(x:xs) -> do
b1 <- f x
bs <- mapMBool f xs
return (b1:bs)
Ultimately, I want to compute wordList >>= mapMBool good, which is a value of type IO [Bool] which have True at the positions where the desired property is satisfied.
Now, I have three main concerns:
- When I run my code with a
toyWordSet, it runs fine but I ultimately intend to replace it with/usr/share/dict/words, on which my PC freezes. How do I make my program more efficient? - Writing a function like
mapMBoolis not the most elegant idea ever. How can I do this more elegantly? - For some reason, I get some strange exceptions, when I try to use the lazy
readFileinstead of the strict version. What is happening? Why must I use strictness?
Other general comments on the coding style or practice is also welcome.
performance haskell file
edited Jan 8 at 18:40
200_success
123k14143401
123k14143401
asked Jan 8 at 16:53
Agnishom Chattopadhyay
21818
21818
mapMBoolis justmapM.
â Gurkenglas
Jan 8 at 19:14
add a comment |Â
mapMBoolis justmapM.
â Gurkenglas
Jan 8 at 19:14
mapMBool is just mapM.â Gurkenglas
Jan 8 at 19:14
mapMBool is just mapM.â Gurkenglas
Jan 8 at 19:14
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
5
down vote
accepted
I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.
Write your functions as though the external world doesn't exist. E.g.âÂÂ
-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set
Write your actions to set up the values you need to use your IO-ignorant functions. E.g.âÂÂ
main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need
When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
Humm... there's something strange going on. Change your last line tomapM_ putStrLn pigHits. Now change it toputStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
â bisserlis
Jan 10 at 4:20
That works. Thanks. I also triedsequence_ $ map putStrLn pigHits. That works too
â Agnishom Chattopadhyay
Jan 10 at 4:33
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.
Write your functions as though the external world doesn't exist. E.g.âÂÂ
-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set
Write your actions to set up the values you need to use your IO-ignorant functions. E.g.âÂÂ
main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need
When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
Humm... there's something strange going on. Change your last line tomapM_ putStrLn pigHits. Now change it toputStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
â bisserlis
Jan 10 at 4:20
That works. Thanks. I also triedsequence_ $ map putStrLn pigHits. That works too
â Agnishom Chattopadhyay
Jan 10 at 4:33
add a comment |Â
up vote
5
down vote
accepted
I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.
Write your functions as though the external world doesn't exist. E.g.âÂÂ
-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set
Write your actions to set up the values you need to use your IO-ignorant functions. E.g.âÂÂ
main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need
When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
Humm... there's something strange going on. Change your last line tomapM_ putStrLn pigHits. Now change it toputStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
â bisserlis
Jan 10 at 4:20
That works. Thanks. I also triedsequence_ $ map putStrLn pigHits. That works too
â Agnishom Chattopadhyay
Jan 10 at 4:33
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.
Write your functions as though the external world doesn't exist. E.g.âÂÂ
-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set
Write your actions to set up the values you need to use your IO-ignorant functions. E.g.âÂÂ
main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need
When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.
I think you've missed the forest for the trees. While the origin of your list might be IO, there's no need to carry values around inside the IO monad just because that's where you found them.
Write your functions as though the external world doesn't exist. E.g.âÂÂ
-- Was `good`
pigExists :: String -> Set String -> Bool
pigExists word set = Set.member (toPigLatin word) set
Write your actions to set up the values you need to use your IO-ignorant functions. E.g.âÂÂ
main :: IO ()
main = do
allWords <- words <$> readFile "wherever" -- equivalent to `allWords <- wordList`
let pigHits = map (word -> pigExists word (Set.fromList allWords)) allWords
print $ length $ filter id pigHits -- Or some other IO action that does what you need
When in doubt, parameterize. You tripped yourself up by defining wordSet directly on wordList instead of just accepting any [String] value you're passed.
answered Jan 8 at 17:39
bisserlis
2,7611614
2,7611614
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
Humm... there's something strange going on. Change your last line tomapM_ putStrLn pigHits. Now change it toputStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
â bisserlis
Jan 10 at 4:20
That works. Thanks. I also triedsequence_ $ map putStrLn pigHits. That works too
â Agnishom Chattopadhyay
Jan 10 at 4:33
add a comment |Â
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
Humm... there's something strange going on. Change your last line tomapM_ putStrLn pigHits. Now change it toputStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.
â bisserlis
Jan 10 at 4:20
That works. Thanks. I also triedsequence_ $ map putStrLn pigHits. That works too
â Agnishom Chattopadhyay
Jan 10 at 4:33
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
This solves the problem of getting frozen, but for a file with 45k words, this still takes forever ( > 30 minutes)
â Agnishom Chattopadhyay
Jan 9 at 13:14
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
Hm... that is wildly excessive. You're running from GHCi, aren't you? Try compiling it. Running my compiled version on a 45k word file takes my machine 0.097s.
â bisserlis
Jan 9 at 18:42
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
I did. It does not seem to help at all. Can I see your code? Here is mine: pastebin.com/qzeFri4H
â Agnishom Chattopadhyay
Jan 10 at 2:15
1
1
Humm... there's something strange going on. Change your last line to
mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.â bisserlis
Jan 10 at 4:20
Humm... there's something strange going on. Change your last line to
mapM_ putStrLn pigHits. Now change it to putStrLn $ unlines pigHits. The former finishes immediately for me, the latter runs until my patience is exceeded (so, longer than 15 seconds). I have no idea what's happening! How exciting. Try posting it as a mystery to Stack Overflow.â bisserlis
Jan 10 at 4:20
That works. Thanks. I also tried
sequence_ $ map putStrLn pigHits. That works tooâ Agnishom Chattopadhyay
Jan 10 at 4:33
That works. Thanks. I also tried
sequence_ $ map putStrLn pigHits. That works tooâ Agnishom Chattopadhyay
Jan 10 at 4:33
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184596%2fhaskell-searching-through-a-large-textfile-efficiently%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
mapMBoolis justmapM.â Gurkenglas
Jan 8 at 19:14