Markov chains to generate text

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
8
down vote

favorite
1












This is my Python 3 code to generate text using a Markov chain.
The chain first randomly selects a word from a text file. Out of all the occurrences of that word in the text file, the program finds the most populer next word for the first randomly selected word. It continues the process to form a very understandable text.



The best thing about this code is that it copies the style of writing in the text file. At the first trial of the code, I put 3 of the most famous Shakespeare's plays, the Macbeth, Julius Caesar and The Comedy of Errors. And when I generated text from it the outcome was very much like a Shakespeare poem.



My knowledge in Python coding is between intermediate and expert. Please review my code and make changes as you like. I want suggestions from both experts and beginners.



# Markov Chain Poetry


import random
import sys

poems = open("text.txt", "r").read()
poems = ''.join([i for i in poems if not i.isdigit()]).replace("nn", "
").split(' ')
# This process the list of poems. Double line breaks separate poems, so they are removed.
# Splitting along spaces creates a list of all words.

index = 1
chain =
count = 1000 # Desired word count of output

# This loop creates a dicitonary called "chain". Each key is a word, and the value of each key
# is an array of the words that immediately followed it.
for word in poems[index:]:
key = poems[index - 1]
if key in chain:
chain[key].append(word)
else:
chain[key] = [word]
index += 1

word1 = random.choice(list(chain.keys())) #random first word
message = word1.capitalize()

# Picks the next word over and over until word count achieved
while len(message.split(' ')) < count:
word2 = random.choice(chain[word1])
word1 = word2
message += ' ' + word2

# creates new file with output and prints it to the terminal
with open("output.txt", "w") as file:
file.write(message)
output = open("output.txt","r")
print(output.read())


Thanks!!!







share|improve this question



























    up vote
    8
    down vote

    favorite
    1












    This is my Python 3 code to generate text using a Markov chain.
    The chain first randomly selects a word from a text file. Out of all the occurrences of that word in the text file, the program finds the most populer next word for the first randomly selected word. It continues the process to form a very understandable text.



    The best thing about this code is that it copies the style of writing in the text file. At the first trial of the code, I put 3 of the most famous Shakespeare's plays, the Macbeth, Julius Caesar and The Comedy of Errors. And when I generated text from it the outcome was very much like a Shakespeare poem.



    My knowledge in Python coding is between intermediate and expert. Please review my code and make changes as you like. I want suggestions from both experts and beginners.



    # Markov Chain Poetry


    import random
    import sys

    poems = open("text.txt", "r").read()
    poems = ''.join([i for i in poems if not i.isdigit()]).replace("nn", "
    ").split(' ')
    # This process the list of poems. Double line breaks separate poems, so they are removed.
    # Splitting along spaces creates a list of all words.

    index = 1
    chain =
    count = 1000 # Desired word count of output

    # This loop creates a dicitonary called "chain". Each key is a word, and the value of each key
    # is an array of the words that immediately followed it.
    for word in poems[index:]:
    key = poems[index - 1]
    if key in chain:
    chain[key].append(word)
    else:
    chain[key] = [word]
    index += 1

    word1 = random.choice(list(chain.keys())) #random first word
    message = word1.capitalize()

    # Picks the next word over and over until word count achieved
    while len(message.split(' ')) < count:
    word2 = random.choice(chain[word1])
    word1 = word2
    message += ' ' + word2

    # creates new file with output and prints it to the terminal
    with open("output.txt", "w") as file:
    file.write(message)
    output = open("output.txt","r")
    print(output.read())


    Thanks!!!







    share|improve this question























      up vote
      8
      down vote

      favorite
      1









      up vote
      8
      down vote

      favorite
      1






      1





      This is my Python 3 code to generate text using a Markov chain.
      The chain first randomly selects a word from a text file. Out of all the occurrences of that word in the text file, the program finds the most populer next word for the first randomly selected word. It continues the process to form a very understandable text.



      The best thing about this code is that it copies the style of writing in the text file. At the first trial of the code, I put 3 of the most famous Shakespeare's plays, the Macbeth, Julius Caesar and The Comedy of Errors. And when I generated text from it the outcome was very much like a Shakespeare poem.



      My knowledge in Python coding is between intermediate and expert. Please review my code and make changes as you like. I want suggestions from both experts and beginners.



      # Markov Chain Poetry


      import random
      import sys

      poems = open("text.txt", "r").read()
      poems = ''.join([i for i in poems if not i.isdigit()]).replace("nn", "
      ").split(' ')
      # This process the list of poems. Double line breaks separate poems, so they are removed.
      # Splitting along spaces creates a list of all words.

      index = 1
      chain =
      count = 1000 # Desired word count of output

      # This loop creates a dicitonary called "chain". Each key is a word, and the value of each key
      # is an array of the words that immediately followed it.
      for word in poems[index:]:
      key = poems[index - 1]
      if key in chain:
      chain[key].append(word)
      else:
      chain[key] = [word]
      index += 1

      word1 = random.choice(list(chain.keys())) #random first word
      message = word1.capitalize()

      # Picks the next word over and over until word count achieved
      while len(message.split(' ')) < count:
      word2 = random.choice(chain[word1])
      word1 = word2
      message += ' ' + word2

      # creates new file with output and prints it to the terminal
      with open("output.txt", "w") as file:
      file.write(message)
      output = open("output.txt","r")
      print(output.read())


      Thanks!!!







      share|improve this question













      This is my Python 3 code to generate text using a Markov chain.
      The chain first randomly selects a word from a text file. Out of all the occurrences of that word in the text file, the program finds the most populer next word for the first randomly selected word. It continues the process to form a very understandable text.



      The best thing about this code is that it copies the style of writing in the text file. At the first trial of the code, I put 3 of the most famous Shakespeare's plays, the Macbeth, Julius Caesar and The Comedy of Errors. And when I generated text from it the outcome was very much like a Shakespeare poem.



      My knowledge in Python coding is between intermediate and expert. Please review my code and make changes as you like. I want suggestions from both experts and beginners.



      # Markov Chain Poetry


      import random
      import sys

      poems = open("text.txt", "r").read()
      poems = ''.join([i for i in poems if not i.isdigit()]).replace("nn", "
      ").split(' ')
      # This process the list of poems. Double line breaks separate poems, so they are removed.
      # Splitting along spaces creates a list of all words.

      index = 1
      chain =
      count = 1000 # Desired word count of output

      # This loop creates a dicitonary called "chain". Each key is a word, and the value of each key
      # is an array of the words that immediately followed it.
      for word in poems[index:]:
      key = poems[index - 1]
      if key in chain:
      chain[key].append(word)
      else:
      chain[key] = [word]
      index += 1

      word1 = random.choice(list(chain.keys())) #random first word
      message = word1.capitalize()

      # Picks the next word over and over until word count achieved
      while len(message.split(' ')) < count:
      word2 = random.choice(chain[word1])
      word1 = word2
      message += ' ' + word2

      # creates new file with output and prints it to the terminal
      with open("output.txt", "w") as file:
      file.write(message)
      output = open("output.txt","r")
      print(output.read())


      Thanks!!!









      share|improve this question












      share|improve this question




      share|improve this question








      edited May 2 at 7:04









      Phrancis

      14.6k644137




      14.6k644137









      asked May 2 at 6:39









      AnanthaKrishna K

      411




      411




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          14
          down vote













          Functions



          Split the code into functions, also split the generation and the presentation. Your algorithm has some clear distinct tasks, so split along these lines:



          • read input

          • assemble chain

          • construct new poem

          • output

          This way, you can reuse parts of the code, save intermediary results and test the parts individually.



          generators



          instead of keeping all the intermediary lists in memory, generators can be a lot more memory efficient. I try to use them as much as possible. substantiating them to a list or dict when needed is easy.



          read the input



          There is no need to assemble the intermediary list in ''.join([i for i in poems if not i.isdigit()]). join is perfectly capable of handling any iterable, so also a generator expression.



          use the with statement to open files:



          def read_input(filename):
          """reads `file`, yields the consecutive words"""
          with open(filename, 'r') as file:
          for line in file:
          for word in line.split(''):
          if word and not word.isdigit():
          yield word


          with regular expressions, and by hoisting the IO, you can ease this method even more:



          def read_input_re(file):
          pattern = re.compile("[a-zA-Z][a-zA-Z']+")
          for line in file:
          for word in pattern.finditer(line):
          yield word.group()


          which then can be called with a file:



          def read_file(filename):
          with open(filename, 'r') as file:
          return read_input_re(file)


          or with any iterable that yields strings as argument. For example if poem holds a multi-line string with a poem:words = read_input_re(poem.split('n'))



          This refactoring also makes loading the different poems from different textfiles almost trivial:



          filenames = ['file1.txt', 'file2.txt', ...]
          parsed_files = (read_file(filename) for filename in filenames)
          words = itertools.chain.from_iterable(parsed_files)


          If you want all the words in the chain lowercase, so FROM and from are marked as the same word, just add



          words = map(str.lower, words)


          assemble the chain



          Here a collections.defaultdict(list) is the natural datastructure to for the chain.



          Instead of using hard indexing to get the subsequent words, which is impossible to do with a generator, you can do it like this:



          def assemble_chain(words):
          chain = defaultdict(list)
          try:
          word, following = next(words), next(words)
          while True:
          chain[word].append(following)
          word, following = following, next(words)
          except StopIteration:
          return chain


          or using some of itertools' useful functions:



          from itertools import tee, islice

          def assemble_chain_itertools(words):
          chain = defaultdict(list)
          words, followings = tee(words, 2)
          for word, following in zip(words, islice(followings, 1, None)):
          chain[word].append(following)
          return chain


          Or even using a deque:



          from collections import deque
          def assemble_chain_deque(words):
          chain = defaultdict(list)
          queue = deque(islice(words, 1), maxlen=2)
          for new_word in words:
          queue.append(new_word)
          word, following = queue
          chain[word].append(following)
          return chain


          whichever is more clear is a matter of habit and experience, If performance is important, you will need to time them.



          create the poem



          Since you will be asking for a new word a lot, it can pay to extract it to its own function:



          def get_random_word(choices):
          return random.choice(list(choices))


          Then you can make an endless generator yielding subsequent words:



          def generate_words(chain):
          word = get_random_word(chain)
          while True:
          yield word
          if word in chain:
          word = get_random_word(chain[word])
          else:
          word = get_random_word(chain)


          We then us islice to gather the number of words we need, which then can be pasted together with ' '.join()



          length = 10
          poem = islice(generate_words(chain), length)
          poem = ' '.join(poem)



          "be tatter'd we desire famine where all eating ask'd where"



          Once you have that, making a poem of a number of lines with set length is also easy:



          def construct_poem(chain, lines, line_length):
          for _ in range(lines):
          yield ' '.join(islice(generate_words(chain), line_length))

          lines = construct_poem(chain, 4, 10)
          lines = map(str.capitalize, lines)
          print('n'.join(lines))



          Be tatter'd we desire famine where all eating ask'd where
          Deep trenches that thereby the riper substantial fuel shall beseige
          Treasure of small pity the riper eyes were to the
          Foe to the riper by time spring within and make



          I think it makes sense to do the capitalization after the line has been assembled. Yet another separation of generation and presentation:



          def construct_poem2(chain, line_lengths):
          for line_length in line_lengths:
          yield ' '.join(islice(generate_words(chain), line_length))

          line_lengths = [10, 8, 8, 10]
          lines = construct_poem2(chain, line_lengths)
          lines = map(str.capitalize, lines)
          print('n'.join(lines))



          Be tatter'd we desire famine where all eating ask'd where
          Deep trenches that thereby the riper substantial fuel
          Shall beseige treasure of small pity the riper
          Eyes were to the riper memory but eyes were to






          share|improve this answer























            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193419%2fmarkov-chains-to-generate-text%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            14
            down vote













            Functions



            Split the code into functions, also split the generation and the presentation. Your algorithm has some clear distinct tasks, so split along these lines:



            • read input

            • assemble chain

            • construct new poem

            • output

            This way, you can reuse parts of the code, save intermediary results and test the parts individually.



            generators



            instead of keeping all the intermediary lists in memory, generators can be a lot more memory efficient. I try to use them as much as possible. substantiating them to a list or dict when needed is easy.



            read the input



            There is no need to assemble the intermediary list in ''.join([i for i in poems if not i.isdigit()]). join is perfectly capable of handling any iterable, so also a generator expression.



            use the with statement to open files:



            def read_input(filename):
            """reads `file`, yields the consecutive words"""
            with open(filename, 'r') as file:
            for line in file:
            for word in line.split(''):
            if word and not word.isdigit():
            yield word


            with regular expressions, and by hoisting the IO, you can ease this method even more:



            def read_input_re(file):
            pattern = re.compile("[a-zA-Z][a-zA-Z']+")
            for line in file:
            for word in pattern.finditer(line):
            yield word.group()


            which then can be called with a file:



            def read_file(filename):
            with open(filename, 'r') as file:
            return read_input_re(file)


            or with any iterable that yields strings as argument. For example if poem holds a multi-line string with a poem:words = read_input_re(poem.split('n'))



            This refactoring also makes loading the different poems from different textfiles almost trivial:



            filenames = ['file1.txt', 'file2.txt', ...]
            parsed_files = (read_file(filename) for filename in filenames)
            words = itertools.chain.from_iterable(parsed_files)


            If you want all the words in the chain lowercase, so FROM and from are marked as the same word, just add



            words = map(str.lower, words)


            assemble the chain



            Here a collections.defaultdict(list) is the natural datastructure to for the chain.



            Instead of using hard indexing to get the subsequent words, which is impossible to do with a generator, you can do it like this:



            def assemble_chain(words):
            chain = defaultdict(list)
            try:
            word, following = next(words), next(words)
            while True:
            chain[word].append(following)
            word, following = following, next(words)
            except StopIteration:
            return chain


            or using some of itertools' useful functions:



            from itertools import tee, islice

            def assemble_chain_itertools(words):
            chain = defaultdict(list)
            words, followings = tee(words, 2)
            for word, following in zip(words, islice(followings, 1, None)):
            chain[word].append(following)
            return chain


            Or even using a deque:



            from collections import deque
            def assemble_chain_deque(words):
            chain = defaultdict(list)
            queue = deque(islice(words, 1), maxlen=2)
            for new_word in words:
            queue.append(new_word)
            word, following = queue
            chain[word].append(following)
            return chain


            whichever is more clear is a matter of habit and experience, If performance is important, you will need to time them.



            create the poem



            Since you will be asking for a new word a lot, it can pay to extract it to its own function:



            def get_random_word(choices):
            return random.choice(list(choices))


            Then you can make an endless generator yielding subsequent words:



            def generate_words(chain):
            word = get_random_word(chain)
            while True:
            yield word
            if word in chain:
            word = get_random_word(chain[word])
            else:
            word = get_random_word(chain)


            We then us islice to gather the number of words we need, which then can be pasted together with ' '.join()



            length = 10
            poem = islice(generate_words(chain), length)
            poem = ' '.join(poem)



            "be tatter'd we desire famine where all eating ask'd where"



            Once you have that, making a poem of a number of lines with set length is also easy:



            def construct_poem(chain, lines, line_length):
            for _ in range(lines):
            yield ' '.join(islice(generate_words(chain), line_length))

            lines = construct_poem(chain, 4, 10)
            lines = map(str.capitalize, lines)
            print('n'.join(lines))



            Be tatter'd we desire famine where all eating ask'd where
            Deep trenches that thereby the riper substantial fuel shall beseige
            Treasure of small pity the riper eyes were to the
            Foe to the riper by time spring within and make



            I think it makes sense to do the capitalization after the line has been assembled. Yet another separation of generation and presentation:



            def construct_poem2(chain, line_lengths):
            for line_length in line_lengths:
            yield ' '.join(islice(generate_words(chain), line_length))

            line_lengths = [10, 8, 8, 10]
            lines = construct_poem2(chain, line_lengths)
            lines = map(str.capitalize, lines)
            print('n'.join(lines))



            Be tatter'd we desire famine where all eating ask'd where
            Deep trenches that thereby the riper substantial fuel
            Shall beseige treasure of small pity the riper
            Eyes were to the riper memory but eyes were to






            share|improve this answer



























              up vote
              14
              down vote













              Functions



              Split the code into functions, also split the generation and the presentation. Your algorithm has some clear distinct tasks, so split along these lines:



              • read input

              • assemble chain

              • construct new poem

              • output

              This way, you can reuse parts of the code, save intermediary results and test the parts individually.



              generators



              instead of keeping all the intermediary lists in memory, generators can be a lot more memory efficient. I try to use them as much as possible. substantiating them to a list or dict when needed is easy.



              read the input



              There is no need to assemble the intermediary list in ''.join([i for i in poems if not i.isdigit()]). join is perfectly capable of handling any iterable, so also a generator expression.



              use the with statement to open files:



              def read_input(filename):
              """reads `file`, yields the consecutive words"""
              with open(filename, 'r') as file:
              for line in file:
              for word in line.split(''):
              if word and not word.isdigit():
              yield word


              with regular expressions, and by hoisting the IO, you can ease this method even more:



              def read_input_re(file):
              pattern = re.compile("[a-zA-Z][a-zA-Z']+")
              for line in file:
              for word in pattern.finditer(line):
              yield word.group()


              which then can be called with a file:



              def read_file(filename):
              with open(filename, 'r') as file:
              return read_input_re(file)


              or with any iterable that yields strings as argument. For example if poem holds a multi-line string with a poem:words = read_input_re(poem.split('n'))



              This refactoring also makes loading the different poems from different textfiles almost trivial:



              filenames = ['file1.txt', 'file2.txt', ...]
              parsed_files = (read_file(filename) for filename in filenames)
              words = itertools.chain.from_iterable(parsed_files)


              If you want all the words in the chain lowercase, so FROM and from are marked as the same word, just add



              words = map(str.lower, words)


              assemble the chain



              Here a collections.defaultdict(list) is the natural datastructure to for the chain.



              Instead of using hard indexing to get the subsequent words, which is impossible to do with a generator, you can do it like this:



              def assemble_chain(words):
              chain = defaultdict(list)
              try:
              word, following = next(words), next(words)
              while True:
              chain[word].append(following)
              word, following = following, next(words)
              except StopIteration:
              return chain


              or using some of itertools' useful functions:



              from itertools import tee, islice

              def assemble_chain_itertools(words):
              chain = defaultdict(list)
              words, followings = tee(words, 2)
              for word, following in zip(words, islice(followings, 1, None)):
              chain[word].append(following)
              return chain


              Or even using a deque:



              from collections import deque
              def assemble_chain_deque(words):
              chain = defaultdict(list)
              queue = deque(islice(words, 1), maxlen=2)
              for new_word in words:
              queue.append(new_word)
              word, following = queue
              chain[word].append(following)
              return chain


              whichever is more clear is a matter of habit and experience, If performance is important, you will need to time them.



              create the poem



              Since you will be asking for a new word a lot, it can pay to extract it to its own function:



              def get_random_word(choices):
              return random.choice(list(choices))


              Then you can make an endless generator yielding subsequent words:



              def generate_words(chain):
              word = get_random_word(chain)
              while True:
              yield word
              if word in chain:
              word = get_random_word(chain[word])
              else:
              word = get_random_word(chain)


              We then us islice to gather the number of words we need, which then can be pasted together with ' '.join()



              length = 10
              poem = islice(generate_words(chain), length)
              poem = ' '.join(poem)



              "be tatter'd we desire famine where all eating ask'd where"



              Once you have that, making a poem of a number of lines with set length is also easy:



              def construct_poem(chain, lines, line_length):
              for _ in range(lines):
              yield ' '.join(islice(generate_words(chain), line_length))

              lines = construct_poem(chain, 4, 10)
              lines = map(str.capitalize, lines)
              print('n'.join(lines))



              Be tatter'd we desire famine where all eating ask'd where
              Deep trenches that thereby the riper substantial fuel shall beseige
              Treasure of small pity the riper eyes were to the
              Foe to the riper by time spring within and make



              I think it makes sense to do the capitalization after the line has been assembled. Yet another separation of generation and presentation:



              def construct_poem2(chain, line_lengths):
              for line_length in line_lengths:
              yield ' '.join(islice(generate_words(chain), line_length))

              line_lengths = [10, 8, 8, 10]
              lines = construct_poem2(chain, line_lengths)
              lines = map(str.capitalize, lines)
              print('n'.join(lines))



              Be tatter'd we desire famine where all eating ask'd where
              Deep trenches that thereby the riper substantial fuel
              Shall beseige treasure of small pity the riper
              Eyes were to the riper memory but eyes were to






              share|improve this answer

























                up vote
                14
                down vote










                up vote
                14
                down vote









                Functions



                Split the code into functions, also split the generation and the presentation. Your algorithm has some clear distinct tasks, so split along these lines:



                • read input

                • assemble chain

                • construct new poem

                • output

                This way, you can reuse parts of the code, save intermediary results and test the parts individually.



                generators



                instead of keeping all the intermediary lists in memory, generators can be a lot more memory efficient. I try to use them as much as possible. substantiating them to a list or dict when needed is easy.



                read the input



                There is no need to assemble the intermediary list in ''.join([i for i in poems if not i.isdigit()]). join is perfectly capable of handling any iterable, so also a generator expression.



                use the with statement to open files:



                def read_input(filename):
                """reads `file`, yields the consecutive words"""
                with open(filename, 'r') as file:
                for line in file:
                for word in line.split(''):
                if word and not word.isdigit():
                yield word


                with regular expressions, and by hoisting the IO, you can ease this method even more:



                def read_input_re(file):
                pattern = re.compile("[a-zA-Z][a-zA-Z']+")
                for line in file:
                for word in pattern.finditer(line):
                yield word.group()


                which then can be called with a file:



                def read_file(filename):
                with open(filename, 'r') as file:
                return read_input_re(file)


                or with any iterable that yields strings as argument. For example if poem holds a multi-line string with a poem:words = read_input_re(poem.split('n'))



                This refactoring also makes loading the different poems from different textfiles almost trivial:



                filenames = ['file1.txt', 'file2.txt', ...]
                parsed_files = (read_file(filename) for filename in filenames)
                words = itertools.chain.from_iterable(parsed_files)


                If you want all the words in the chain lowercase, so FROM and from are marked as the same word, just add



                words = map(str.lower, words)


                assemble the chain



                Here a collections.defaultdict(list) is the natural datastructure to for the chain.



                Instead of using hard indexing to get the subsequent words, which is impossible to do with a generator, you can do it like this:



                def assemble_chain(words):
                chain = defaultdict(list)
                try:
                word, following = next(words), next(words)
                while True:
                chain[word].append(following)
                word, following = following, next(words)
                except StopIteration:
                return chain


                or using some of itertools' useful functions:



                from itertools import tee, islice

                def assemble_chain_itertools(words):
                chain = defaultdict(list)
                words, followings = tee(words, 2)
                for word, following in zip(words, islice(followings, 1, None)):
                chain[word].append(following)
                return chain


                Or even using a deque:



                from collections import deque
                def assemble_chain_deque(words):
                chain = defaultdict(list)
                queue = deque(islice(words, 1), maxlen=2)
                for new_word in words:
                queue.append(new_word)
                word, following = queue
                chain[word].append(following)
                return chain


                whichever is more clear is a matter of habit and experience, If performance is important, you will need to time them.



                create the poem



                Since you will be asking for a new word a lot, it can pay to extract it to its own function:



                def get_random_word(choices):
                return random.choice(list(choices))


                Then you can make an endless generator yielding subsequent words:



                def generate_words(chain):
                word = get_random_word(chain)
                while True:
                yield word
                if word in chain:
                word = get_random_word(chain[word])
                else:
                word = get_random_word(chain)


                We then us islice to gather the number of words we need, which then can be pasted together with ' '.join()



                length = 10
                poem = islice(generate_words(chain), length)
                poem = ' '.join(poem)



                "be tatter'd we desire famine where all eating ask'd where"



                Once you have that, making a poem of a number of lines with set length is also easy:



                def construct_poem(chain, lines, line_length):
                for _ in range(lines):
                yield ' '.join(islice(generate_words(chain), line_length))

                lines = construct_poem(chain, 4, 10)
                lines = map(str.capitalize, lines)
                print('n'.join(lines))



                Be tatter'd we desire famine where all eating ask'd where
                Deep trenches that thereby the riper substantial fuel shall beseige
                Treasure of small pity the riper eyes were to the
                Foe to the riper by time spring within and make



                I think it makes sense to do the capitalization after the line has been assembled. Yet another separation of generation and presentation:



                def construct_poem2(chain, line_lengths):
                for line_length in line_lengths:
                yield ' '.join(islice(generate_words(chain), line_length))

                line_lengths = [10, 8, 8, 10]
                lines = construct_poem2(chain, line_lengths)
                lines = map(str.capitalize, lines)
                print('n'.join(lines))



                Be tatter'd we desire famine where all eating ask'd where
                Deep trenches that thereby the riper substantial fuel
                Shall beseige treasure of small pity the riper
                Eyes were to the riper memory but eyes were to






                share|improve this answer















                Functions



                Split the code into functions, also split the generation and the presentation. Your algorithm has some clear distinct tasks, so split along these lines:



                • read input

                • assemble chain

                • construct new poem

                • output

                This way, you can reuse parts of the code, save intermediary results and test the parts individually.



                generators



                instead of keeping all the intermediary lists in memory, generators can be a lot more memory efficient. I try to use them as much as possible. substantiating them to a list or dict when needed is easy.



                read the input



                There is no need to assemble the intermediary list in ''.join([i for i in poems if not i.isdigit()]). join is perfectly capable of handling any iterable, so also a generator expression.



                use the with statement to open files:



                def read_input(filename):
                """reads `file`, yields the consecutive words"""
                with open(filename, 'r') as file:
                for line in file:
                for word in line.split(''):
                if word and not word.isdigit():
                yield word


                with regular expressions, and by hoisting the IO, you can ease this method even more:



                def read_input_re(file):
                pattern = re.compile("[a-zA-Z][a-zA-Z']+")
                for line in file:
                for word in pattern.finditer(line):
                yield word.group()


                which then can be called with a file:



                def read_file(filename):
                with open(filename, 'r') as file:
                return read_input_re(file)


                or with any iterable that yields strings as argument. For example if poem holds a multi-line string with a poem:words = read_input_re(poem.split('n'))



                This refactoring also makes loading the different poems from different textfiles almost trivial:



                filenames = ['file1.txt', 'file2.txt', ...]
                parsed_files = (read_file(filename) for filename in filenames)
                words = itertools.chain.from_iterable(parsed_files)


                If you want all the words in the chain lowercase, so FROM and from are marked as the same word, just add



                words = map(str.lower, words)


                assemble the chain



                Here a collections.defaultdict(list) is the natural datastructure to for the chain.



                Instead of using hard indexing to get the subsequent words, which is impossible to do with a generator, you can do it like this:



                def assemble_chain(words):
                chain = defaultdict(list)
                try:
                word, following = next(words), next(words)
                while True:
                chain[word].append(following)
                word, following = following, next(words)
                except StopIteration:
                return chain


                or using some of itertools' useful functions:



                from itertools import tee, islice

                def assemble_chain_itertools(words):
                chain = defaultdict(list)
                words, followings = tee(words, 2)
                for word, following in zip(words, islice(followings, 1, None)):
                chain[word].append(following)
                return chain


                Or even using a deque:



                from collections import deque
                def assemble_chain_deque(words):
                chain = defaultdict(list)
                queue = deque(islice(words, 1), maxlen=2)
                for new_word in words:
                queue.append(new_word)
                word, following = queue
                chain[word].append(following)
                return chain


                whichever is more clear is a matter of habit and experience, If performance is important, you will need to time them.



                create the poem



                Since you will be asking for a new word a lot, it can pay to extract it to its own function:



                def get_random_word(choices):
                return random.choice(list(choices))


                Then you can make an endless generator yielding subsequent words:



                def generate_words(chain):
                word = get_random_word(chain)
                while True:
                yield word
                if word in chain:
                word = get_random_word(chain[word])
                else:
                word = get_random_word(chain)


                We then us islice to gather the number of words we need, which then can be pasted together with ' '.join()



                length = 10
                poem = islice(generate_words(chain), length)
                poem = ' '.join(poem)



                "be tatter'd we desire famine where all eating ask'd where"



                Once you have that, making a poem of a number of lines with set length is also easy:



                def construct_poem(chain, lines, line_length):
                for _ in range(lines):
                yield ' '.join(islice(generate_words(chain), line_length))

                lines = construct_poem(chain, 4, 10)
                lines = map(str.capitalize, lines)
                print('n'.join(lines))



                Be tatter'd we desire famine where all eating ask'd where
                Deep trenches that thereby the riper substantial fuel shall beseige
                Treasure of small pity the riper eyes were to the
                Foe to the riper by time spring within and make



                I think it makes sense to do the capitalization after the line has been assembled. Yet another separation of generation and presentation:



                def construct_poem2(chain, line_lengths):
                for line_length in line_lengths:
                yield ' '.join(islice(generate_words(chain), line_length))

                line_lengths = [10, 8, 8, 10]
                lines = construct_poem2(chain, line_lengths)
                lines = map(str.capitalize, lines)
                print('n'.join(lines))



                Be tatter'd we desire famine where all eating ask'd where
                Deep trenches that thereby the riper substantial fuel
                Shall beseige treasure of small pity the riper
                Eyes were to the riper memory but eyes were to







                share|improve this answer















                share|improve this answer



                share|improve this answer








                edited May 2 at 12:27


























                answered May 2 at 11:00









                Maarten Fabré

                3,204214




                3,204214






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193419%2fmarkov-chains-to-generate-text%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Greedy Best First Search implementation in Rust

                    Function to Return a JSON Like Objects Using VBA Collections and Arrays

                    C++11 CLH Lock Implementation