Word counter using a word list and some text files

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.

The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?

I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main 

public static void main(String args) 

 //This is the map that will contain each word
 Map<String, Integer> map = new HashMap<>();
 //int that will count how manny times the word is in the File set 2
 int wordCounter = 0;
 //List that contain arround 500k unrepeted words
 List<String> list = new ArrayList<>();
 //List that contains the current file words
 List<String> list1 = new ArrayList<>();


 try 
 //scans the file that contains the 500k unrepeted words
 Scanner s = new Scanner(new File("C:\Users\filepath"));
 //while loop that add the words to a list so it can manipulate it latter on
 while (s.hasNext()) 
 list.add(s.next());
 
 //random output to see the Set size
 System.out.println(list.size());


 //main loop that will cheek each word in the 500k file
 for (int i = 0; i < list.size(); i++) 
 //loop to se each file of words
 for (int j = 0; j < 100; j++) 
 try 
 //read each file
 Scanner d = new Scanner(new File("C:\Users\filepath" + j));
 //add the information of each file
 while (d.hasNext()) 
 list1.add(d.next());
 
 d.close();
 //this code counts the number of words in all the files a
 wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
 //clears the list so it has more space and not run out of it
 list1.clear();

 catch (IOException k) 
 k.printStackTrace();
 
 
 //adds the information to the map
 map.put(list.get(i), wordCounter);
 //this sorts the information and discard the words that only has 1 or less matches
 if (wordCounter > 1) 
 try 
 FileWriter fw = new FileWriter("C:\Users\filePath", true);
 PrintWriter pw = new PrintWriter(fw);
 pw.append("n");
 pw.append(map.toString());
 pw.close();

 catch (IOException f) 
 f.printStackTrace();
 
 
 //this clean the map so it doesnt run out of memory
 map.clear();
 //resets the counter to 0
 wordCounter = 0;
 //simple display so it seems nice
 System.out.println(i);

 
 catch (IOException f)
 f.printStackTrace();

Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?

edited Feb 26 at 20:59

200_success

123k14142399

asked Feb 26 at 20:49

BLH-Maxx

111

So you load each of the 173 files into an array, do your thing, and then load the next?
â€“Â Raystafarian
Feb 26 at 22:01

yes but one by one because if i load them all at the same time it will get out of memory.
â€“Â BLH-Maxx
Feb 26 at 22:06

You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â€“Â Raystafarian
Feb 26 at 22:08

That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â€“Â BLH-Maxx
Feb 26 at 22:17

Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â€“Â Raystafarian
Feb 26 at 23:40

add a commentÂ |Â

up vote
2
down vote

favorite

I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main 

public static void main(String args) 

 //This is the map that will contain each word
 Map<String, Integer> map = new HashMap<>();
 //int that will count how manny times the word is in the File set 2
 int wordCounter = 0;
 //List that contain arround 500k unrepeted words
 List<String> list = new ArrayList<>();
 //List that contains the current file words
 List<String> list1 = new ArrayList<>();


 try 
 //scans the file that contains the 500k unrepeted words
 Scanner s = new Scanner(new File("C:\Users\filepath"));
 //while loop that add the words to a list so it can manipulate it latter on
 while (s.hasNext()) 
 list.add(s.next());
 
 //random output to see the Set size
 System.out.println(list.size());


 //main loop that will cheek each word in the 500k file
 for (int i = 0; i < list.size(); i++) 
 //loop to se each file of words
 for (int j = 0; j < 100; j++) 
 try 
 //read each file
 Scanner d = new Scanner(new File("C:\Users\filepath" + j));
 //add the information of each file
 while (d.hasNext()) 
 list1.add(d.next());
 
 d.close();
 //this code counts the number of words in all the files a
 wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
 //clears the list so it has more space and not run out of it
 list1.clear();

 catch (IOException k) 
 k.printStackTrace();
 
 
 //adds the information to the map
 map.put(list.get(i), wordCounter);
 //this sorts the information and discard the words that only has 1 or less matches
 if (wordCounter > 1) 
 try 
 FileWriter fw = new FileWriter("C:\Users\filePath", true);
 PrintWriter pw = new PrintWriter(fw);
 pw.append("n");
 pw.append(map.toString());
 pw.close();

 catch (IOException f) 
 f.printStackTrace();
 
 
 //this clean the map so it doesnt run out of memory
 map.clear();
 //resets the counter to 0
 wordCounter = 0;
 //simple display so it seems nice
 System.out.println(i);

 
 catch (IOException f)
 f.printStackTrace();

Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?

edited Feb 26 at 20:59

200_success

123k14142399

asked Feb 26 at 20:49

BLH-Maxx

111

So you load each of the 173 files into an array, do your thing, and then load the next?
â€“Â Raystafarian
Feb 26 at 22:01

yes but one by one because if i load them all at the same time it will get out of memory.
â€“Â BLH-Maxx
Feb 26 at 22:06

You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â€“Â Raystafarian
Feb 26 at 22:08

That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â€“Â BLH-Maxx
Feb 26 at 22:17

Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â€“Â Raystafarian
Feb 26 at 23:40

add a commentÂ |Â

up vote
2
down vote

favorite

I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main 

public static void main(String args) 

 //This is the map that will contain each word
 Map<String, Integer> map = new HashMap<>();
 //int that will count how manny times the word is in the File set 2
 int wordCounter = 0;
 //List that contain arround 500k unrepeted words
 List<String> list = new ArrayList<>();
 //List that contains the current file words
 List<String> list1 = new ArrayList<>();


 try 
 //scans the file that contains the 500k unrepeted words
 Scanner s = new Scanner(new File("C:\Users\filepath"));
 //while loop that add the words to a list so it can manipulate it latter on
 while (s.hasNext()) 
 list.add(s.next());
 
 //random output to see the Set size
 System.out.println(list.size());


 //main loop that will cheek each word in the 500k file
 for (int i = 0; i < list.size(); i++) 
 //loop to se each file of words
 for (int j = 0; j < 100; j++) 
 try 
 //read each file
 Scanner d = new Scanner(new File("C:\Users\filepath" + j));
 //add the information of each file
 while (d.hasNext()) 
 list1.add(d.next());
 
 d.close();
 //this code counts the number of words in all the files a
 wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
 //clears the list so it has more space and not run out of it
 list1.clear();

 catch (IOException k) 
 k.printStackTrace();
 
 
 //adds the information to the map
 map.put(list.get(i), wordCounter);
 //this sorts the information and discard the words that only has 1 or less matches
 if (wordCounter > 1) 
 try 
 FileWriter fw = new FileWriter("C:\Users\filePath", true);
 PrintWriter pw = new PrintWriter(fw);
 pw.append("n");
 pw.append(map.toString());
 pw.close();

 catch (IOException f) 
 f.printStackTrace();
 
 
 //this clean the map so it doesnt run out of memory
 map.clear();
 //resets the counter to 0
 wordCounter = 0;
 //simple display so it seems nice
 System.out.println(i);

 
 catch (IOException f)
 f.printStackTrace();

Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?

edited Feb 26 at 20:59

200_success

123k14142399

asked Feb 26 at 20:49

BLH-Maxx

111

I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class Main 

public static void main(String args) 

 //This is the map that will contain each word
 Map<String, Integer> map = new HashMap<>();
 //int that will count how manny times the word is in the File set 2
 int wordCounter = 0;
 //List that contain arround 500k unrepeted words
 List<String> list = new ArrayList<>();
 //List that contains the current file words
 List<String> list1 = new ArrayList<>();


 try 
 //scans the file that contains the 500k unrepeted words
 Scanner s = new Scanner(new File("C:\Users\filepath"));
 //while loop that add the words to a list so it can manipulate it latter on
 while (s.hasNext()) 
 list.add(s.next());
 
 //random output to see the Set size
 System.out.println(list.size());


 //main loop that will cheek each word in the 500k file
 for (int i = 0; i < list.size(); i++) 
 //loop to se each file of words
 for (int j = 0; j < 100; j++) 
 try 
 //read each file
 Scanner d = new Scanner(new File("C:\Users\filepath" + j));
 //add the information of each file
 while (d.hasNext()) 
 list1.add(d.next());
 
 d.close();
 //this code counts the number of words in all the files a
 wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
 //clears the list so it has more space and not run out of it
 list1.clear();

 catch (IOException k) 
 k.printStackTrace();
 
 
 //adds the information to the map
 map.put(list.get(i), wordCounter);
 //this sorts the information and discard the words that only has 1 or less matches
 if (wordCounter > 1) 
 try 
 FileWriter fw = new FileWriter("C:\Users\filePath", true);
 PrintWriter pw = new PrintWriter(fw);
 pw.append("n");
 pw.append(map.toString());
 pw.close();

 catch (IOException f) 
 f.printStackTrace();
 
 
 //this clean the map so it doesnt run out of memory
 map.clear();
 //resets the counter to 0
 wordCounter = 0;
 //simple display so it seems nice
 System.out.println(i);

 
 catch (IOException f)
 f.printStackTrace();

Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?

edited Feb 26 at 20:59

200_success

123k14142399

asked Feb 26 at 20:49

BLH-Maxx

111

edited Feb 26 at 20:59

200_success

123k14142399

edited Feb 26 at 20:59

200_success

123k14142399

edited Feb 26 at 20:59

200_success

123k14142399

asked Feb 26 at 20:49

BLH-Maxx

111

asked Feb 26 at 20:49

BLH-Maxx

111

asked Feb 26 at 20:49

BLH-Maxx

111

So you load each of the 173 files into an array, do your thing, and then load the next?
â€“Â Raystafarian
Feb 26 at 22:01

yes but one by one because if i load them all at the same time it will get out of memory.
â€“Â BLH-Maxx
Feb 26 at 22:06

You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â€“Â Raystafarian
Feb 26 at 22:08

That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â€“Â BLH-Maxx
Feb 26 at 22:17

Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â€“Â Raystafarian
Feb 26 at 23:40

add a commentÂ |Â

So you load each of the 173 files into an array, do your thing, and then load the next?
â€“Â Raystafarian
Feb 26 at 22:01

yes but one by one because if i load them all at the same time it will get out of memory.
â€“Â BLH-Maxx
Feb 26 at 22:06

You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â€“Â Raystafarian
Feb 26 at 22:08

That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â€“Â BLH-Maxx
Feb 26 at 22:17

Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â€“Â Raystafarian
Feb 26 at 23:40

So you load each of the 173 files into an array, do your thing, and then load the next?
â€“Â Raystafarian
Feb 26 at 22:01

yes but one by one because if i load them all at the same time it will get out of memory.
â€“Â BLH-Maxx
Feb 26 at 22:06

You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â€“Â Raystafarian
Feb 26 at 22:08

That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â€“Â BLH-Maxx
Feb 26 at 22:17

Forgive me if I'm wrong (no java here), but the maximum array size in java is 2 147 483 639 items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â€“Â Raystafarian
Feb 26 at 23:40

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.

To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.

Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.

answered Feb 26 at 21:34

Markus Meyer

211

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

add a commentÂ |Â

up vote
0
down vote

First of all you should really use more meaningful names for your variables.

wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.

Just renaming these makes it so much easier to follow what your program is doing.

You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.

We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.

Here's the main idea:

Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
 wordCount.put(wordFile.next(), 0);

At this point we got an entry in our count map for each of the 500k words.

for( each file with wiki text)

loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.

 while(file.hasNext())
 String word = file.next();

At this point we're looping over each word in the file and want to update the total count.

 Map<String, Integer> wordCount = new HashMap<>();
 if (wordCount.containsKey(word)) 
 wordCount.put(word, wordCount.get(word)+1);

It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).

And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.

answered Feb 27 at 15:43

Imus

3,328223

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188399%2fword-counter-using-a-word-list-and-some-text-files%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.

answered Feb 26 at 21:34

Markus Meyer

211

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

add a commentÂ |Â

up vote
2
down vote

Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.

answered Feb 26 at 21:34

Markus Meyer

211

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

add a commentÂ |Â

up vote
2
down vote

Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.

answered Feb 26 at 21:34

Markus Meyer

211

Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.

answered Feb 26 at 21:34

Markus Meyer

211

answered Feb 26 at 21:34

Markus Meyer

211

answered Feb 26 at 21:34

Markus Meyer

211

answered Feb 26 at 21:34

Markus Meyer

211

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

add a commentÂ |Â

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â€“Â BLH-Maxx
Feb 26 at 21:53

First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â€“Â Markus Meyer
Feb 26 at 22:09

the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â€“Â BLH-Maxx
Feb 26 at 22:15

And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â€“Â BLH-Maxx
Feb 26 at 22:21

add a commentÂ |Â

up vote
0
down vote

First of all you should really use more meaningful names for your variables.

wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.

Just renaming these makes it so much easier to follow what your program is doing.

You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.

We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.

Here's the main idea:

Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
 wordCount.put(wordFile.next(), 0);

At this point we got an entry in our count map for each of the 500k words.

for( each file with wiki text)

 while(file.hasNext())
 String word = file.next();

At this point we're looping over each word in the file and want to update the total count.

 Map<String, Integer> wordCount = new HashMap<>();
 if (wordCount.containsKey(word)) 
 wordCount.put(word, wordCount.get(word)+1);

It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).

answered Feb 27 at 15:43

Imus

3,328223

add a commentÂ |Â

up vote
0
down vote

First of all you should really use more meaningful names for your variables.

wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.

Just renaming these makes it so much easier to follow what your program is doing.

You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.

We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.

Here's the main idea:

Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
 wordCount.put(wordFile.next(), 0);

At this point we got an entry in our count map for each of the 500k words.

for( each file with wiki text)

 while(file.hasNext())
 String word = file.next();

At this point we're looping over each word in the file and want to update the total count.

 Map<String, Integer> wordCount = new HashMap<>();
 if (wordCount.containsKey(word)) 
 wordCount.put(word, wordCount.get(word)+1);

It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).

answered Feb 27 at 15:43

Imus

3,328223

add a commentÂ |Â

up vote
0
down vote

First of all you should really use more meaningful names for your variables.

wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.

Just renaming these makes it so much easier to follow what your program is doing.

You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.

We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.

Here's the main idea:

Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
 wordCount.put(wordFile.next(), 0);

At this point we got an entry in our count map for each of the 500k words.

for( each file with wiki text)

 while(file.hasNext())
 String word = file.next();

At this point we're looping over each word in the file and want to update the total count.

 Map<String, Integer> wordCount = new HashMap<>();
 if (wordCount.containsKey(word)) 
 wordCount.put(word, wordCount.get(word)+1);

It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).

answered Feb 27 at 15:43

Imus

3,328223

First of all you should really use more meaningful names for your variables.

wordCount says a lot more than map, uniqueWords a lot more than list and same for wordsInCurrentFile instead of list1.

Just renaming these makes it so much easier to follow what your program is doing.

You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.

We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.

Here's the main idea:

Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
 wordCount.put(wordFile.next(), 0);

At this point we got an entry in our count map for each of the 500k words.

for( each file with wiki text)

 while(file.hasNext())
 String word = file.next();

At this point we're looping over each word in the file and want to update the total count.

 Map<String, Integer> wordCount = new HashMap<>();
 if (wordCount.containsKey(word)) 
 wordCount.put(word, wordCount.get(word)+1);

It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).

answered Feb 27 at 15:43

Imus

3,328223

answered Feb 27 at 15:43

Imus

3,328223

answered Feb 27 at 15:43

Imus

3,328223

answered Feb 27 at 15:43

Imus

3,328223

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr