Word counter using a word list and some text files

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.



The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?



I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.



import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main

public static void main(String args)

//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();


try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());

//random output to see the Set size
System.out.println(list.size());


//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());

d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();

catch (IOException k)
k.printStackTrace();


//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();

catch (IOException f)
f.printStackTrace();


//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);


catch (IOException f)
f.printStackTrace();









Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?







share|improve this question





















  • So you load each of the 173 files into an array, do your thing, and then load the next?
    – Raystafarian
    Feb 26 at 22:01










  • yes but one by one because if i load them all at the same time it will get out of memory.
    – BLH-Maxx
    Feb 26 at 22:06










  • You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
    – Raystafarian
    Feb 26 at 22:08










  • That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
    – BLH-Maxx
    Feb 26 at 22:17











  • Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
    – Raystafarian
    Feb 26 at 23:40
















up vote
2
down vote

favorite












My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.



The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?



I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.



import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main

public static void main(String args)

//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();


try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());

//random output to see the Set size
System.out.println(list.size());


//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());

d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();

catch (IOException k)
k.printStackTrace();


//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();

catch (IOException f)
f.printStackTrace();


//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);


catch (IOException f)
f.printStackTrace();









Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?







share|improve this question





















  • So you load each of the 173 files into an array, do your thing, and then load the next?
    – Raystafarian
    Feb 26 at 22:01










  • yes but one by one because if i load them all at the same time it will get out of memory.
    – BLH-Maxx
    Feb 26 at 22:06










  • You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
    – Raystafarian
    Feb 26 at 22:08










  • That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
    – BLH-Maxx
    Feb 26 at 22:17











  • Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
    – Raystafarian
    Feb 26 at 23:40












up vote
2
down vote

favorite









up vote
2
down vote

favorite











My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.



The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?



I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.



import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main

public static void main(String args)

//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();


try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());

//random output to see the Set size
System.out.println(list.size());


//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());

d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();

catch (IOException k)
k.printStackTrace();


//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();

catch (IOException f)
f.printStackTrace();


//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);


catch (IOException f)
f.printStackTrace();









Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?







share|improve this question













My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.



The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?



I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.



import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main

public static void main(String args)

//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();


try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());

//random output to see the Set size
System.out.println(list.size());


//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());

d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();

catch (IOException k)
k.printStackTrace();


//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();

catch (IOException f)
f.printStackTrace();


//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);


catch (IOException f)
f.printStackTrace();









Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?









share|improve this question












share|improve this question




share|improve this question








edited Feb 26 at 20:59









200_success

123k14142399




123k14142399









asked Feb 26 at 20:49









BLH-Maxx

111




111











  • So you load each of the 173 files into an array, do your thing, and then load the next?
    – Raystafarian
    Feb 26 at 22:01










  • yes but one by one because if i load them all at the same time it will get out of memory.
    – BLH-Maxx
    Feb 26 at 22:06










  • You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
    – Raystafarian
    Feb 26 at 22:08










  • That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
    – BLH-Maxx
    Feb 26 at 22:17











  • Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
    – Raystafarian
    Feb 26 at 23:40
















  • So you load each of the 173 files into an array, do your thing, and then load the next?
    – Raystafarian
    Feb 26 at 22:01










  • yes but one by one because if i load them all at the same time it will get out of memory.
    – BLH-Maxx
    Feb 26 at 22:06










  • You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
    – Raystafarian
    Feb 26 at 22:08










  • That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
    – BLH-Maxx
    Feb 26 at 22:17











  • Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
    – Raystafarian
    Feb 26 at 23:40















So you load each of the 173 files into an array, do your thing, and then load the next?
– Raystafarian
Feb 26 at 22:01




So you load each of the 173 files into an array, do your thing, and then load the next?
– Raystafarian
Feb 26 at 22:01












yes but one by one because if i load them all at the same time it will get out of memory.
– BLH-Maxx
Feb 26 at 22:06




yes but one by one because if i load them all at the same time it will get out of memory.
– BLH-Maxx
Feb 26 at 22:06












You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
– Raystafarian
Feb 26 at 22:08




You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
– Raystafarian
Feb 26 at 22:08












That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
– BLH-Maxx
Feb 26 at 22:17





That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
– BLH-Maxx
Feb 26 at 22:17













Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
– Raystafarian
Feb 26 at 23:40




Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
– Raystafarian
Feb 26 at 23:40










2 Answers
2






active

oldest

votes

















up vote
2
down vote













You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.



To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.



Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.






share|improve this answer





















  • thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
    – BLH-Maxx
    Feb 26 at 21:53











  • First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
    – Markus Meyer
    Feb 26 at 22:09










  • the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
    – BLH-Maxx
    Feb 26 at 22:15










  • And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
    – BLH-Maxx
    Feb 26 at 22:21

















up vote
0
down vote













First of all you should really use more meaningful names for your variables.



wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.



Just renaming these makes it so much easier to follow what your program is doing.




You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.



We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.



Here's the main idea:



Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);



At this point we got an entry in our count map for each of the 500k words.



for( each file with wiki text) 


loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.



 while(file.hasNext())
String word = file.next();


At this point we're looping over each word in the file and want to update the total count.



 Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);






It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).



And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.






share|improve this answer





















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188399%2fword-counter-using-a-word-list-and-some-text-files%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.



    To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.



    Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.






    share|improve this answer





















    • thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
      – BLH-Maxx
      Feb 26 at 21:53











    • First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
      – Markus Meyer
      Feb 26 at 22:09










    • the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
      – BLH-Maxx
      Feb 26 at 22:15










    • And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
      – BLH-Maxx
      Feb 26 at 22:21














    up vote
    2
    down vote













    You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.



    To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.



    Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.






    share|improve this answer





















    • thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
      – BLH-Maxx
      Feb 26 at 21:53











    • First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
      – Markus Meyer
      Feb 26 at 22:09










    • the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
      – BLH-Maxx
      Feb 26 at 22:15










    • And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
      – BLH-Maxx
      Feb 26 at 22:21












    up vote
    2
    down vote










    up vote
    2
    down vote









    You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.



    To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.



    Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.






    share|improve this answer













    You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.



    To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.



    Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.







    share|improve this answer













    share|improve this answer



    share|improve this answer











    answered Feb 26 at 21:34









    Markus Meyer

    211




    211











    • thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
      – BLH-Maxx
      Feb 26 at 21:53











    • First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
      – Markus Meyer
      Feb 26 at 22:09










    • the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
      – BLH-Maxx
      Feb 26 at 22:15










    • And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
      – BLH-Maxx
      Feb 26 at 22:21
















    • thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
      – BLH-Maxx
      Feb 26 at 21:53











    • First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
      – Markus Meyer
      Feb 26 at 22:09










    • the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
      – BLH-Maxx
      Feb 26 at 22:15










    • And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
      – BLH-Maxx
      Feb 26 at 22:21















    thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
    – BLH-Maxx
    Feb 26 at 21:53





    thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
    – BLH-Maxx
    Feb 26 at 21:53













    First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
    – Markus Meyer
    Feb 26 at 22:09




    First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
    – Markus Meyer
    Feb 26 at 22:09












    the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
    – BLH-Maxx
    Feb 26 at 22:15




    the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
    – BLH-Maxx
    Feb 26 at 22:15












    And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
    – BLH-Maxx
    Feb 26 at 22:21




    And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
    – BLH-Maxx
    Feb 26 at 22:21












    up vote
    0
    down vote













    First of all you should really use more meaningful names for your variables.



    wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.



    Just renaming these makes it so much easier to follow what your program is doing.




    You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.



    We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.



    Here's the main idea:



    Map<String, int> wordCount = new HashMap<>();
    File wordFile = new File( ...); //open the file with unique words
    while(wordFile.hasNext())
    wordCount.put(wordFile.next(), 0);



    At this point we got an entry in our count map for each of the 500k words.



    for( each file with wiki text) 


    loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.



     while(file.hasNext())
    String word = file.next();


    At this point we're looping over each word in the file and want to update the total count.



     Map<String, Integer> wordCount = new HashMap<>();
    if (wordCount.containsKey(word))
    wordCount.put(word, wordCount.get(word)+1);






    It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).



    And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.






    share|improve this answer

























      up vote
      0
      down vote













      First of all you should really use more meaningful names for your variables.



      wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.



      Just renaming these makes it so much easier to follow what your program is doing.




      You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.



      We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.



      Here's the main idea:



      Map<String, int> wordCount = new HashMap<>();
      File wordFile = new File( ...); //open the file with unique words
      while(wordFile.hasNext())
      wordCount.put(wordFile.next(), 0);



      At this point we got an entry in our count map for each of the 500k words.



      for( each file with wiki text) 


      loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.



       while(file.hasNext())
      String word = file.next();


      At this point we're looping over each word in the file and want to update the total count.



       Map<String, Integer> wordCount = new HashMap<>();
      if (wordCount.containsKey(word))
      wordCount.put(word, wordCount.get(word)+1);






      It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).



      And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        First of all you should really use more meaningful names for your variables.



        wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.



        Just renaming these makes it so much easier to follow what your program is doing.




        You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.



        We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.



        Here's the main idea:



        Map<String, int> wordCount = new HashMap<>();
        File wordFile = new File( ...); //open the file with unique words
        while(wordFile.hasNext())
        wordCount.put(wordFile.next(), 0);



        At this point we got an entry in our count map for each of the 500k words.



        for( each file with wiki text) 


        loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.



         while(file.hasNext())
        String word = file.next();


        At this point we're looping over each word in the file and want to update the total count.



         Map<String, Integer> wordCount = new HashMap<>();
        if (wordCount.containsKey(word))
        wordCount.put(word, wordCount.get(word)+1);






        It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).



        And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.






        share|improve this answer













        First of all you should really use more meaningful names for your variables.



        wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.



        Just renaming these makes it so much easier to follow what your program is doing.




        You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.



        We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.



        Here's the main idea:



        Map<String, int> wordCount = new HashMap<>();
        File wordFile = new File( ...); //open the file with unique words
        while(wordFile.hasNext())
        wordCount.put(wordFile.next(), 0);



        At this point we got an entry in our count map for each of the 500k words.



        for( each file with wiki text) 


        loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.



         while(file.hasNext())
        String word = file.next();


        At this point we're looping over each word in the file and want to update the total count.



         Map<String, Integer> wordCount = new HashMap<>();
        if (wordCount.containsKey(word))
        wordCount.put(word, wordCount.get(word)+1);






        It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).



        And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Feb 27 at 15:43









        Imus

        3,328223




        3,328223






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188399%2fword-counter-using-a-word-list-and-some-text-files%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation