Word counter using a word list and some text files
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.
The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?
I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;
public class Main
public static void main(String args)
//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();
try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());
//random output to see the Set size
System.out.println(list.size());
//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());
d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();
catch (IOException k)
k.printStackTrace();
//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();
catch (IOException f)
f.printStackTrace();
//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);
catch (IOException f)
f.printStackTrace();
Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?
java beginner time-limit-exceeded file
add a comment |Â
up vote
2
down vote
favorite
My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.
The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?
I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;
public class Main
public static void main(String args)
//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();
try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());
//random output to see the Set size
System.out.println(list.size());
//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());
d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();
catch (IOException k)
k.printStackTrace();
//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();
catch (IOException f)
f.printStackTrace();
//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);
catch (IOException f)
f.printStackTrace();
Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?
java beginner time-limit-exceeded file
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
Forgive me if I'm wrong (no java here), but the maximum array size in java is2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â Raystafarian
Feb 26 at 23:40
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.
The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?
I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;
public class Main
public static void main(String args)
//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();
try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());
//random output to see the Set size
System.out.println(list.size());
//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());
d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();
catch (IOException k)
k.printStackTrace();
//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();
catch (IOException f)
f.printStackTrace();
//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);
catch (IOException f)
f.printStackTrace();
Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?
java beginner time-limit-exceeded file
My program works with two types of files.
File 1 contains 500,000 distinct words.
File set 2 contains 173 text files, each containing 500 paragraphs, that I scraped from Wikipedia.
The program counts how many times each word from the first file appears in the second set of files.
The main problem I have is that it's taking around 4 seconds per word to process so it will take around 24 days to complete all 500k words in a core5 7th gen 8gb ram laptop. Is it possible to make it more process efficient?
I am still learning Java so my knowledge is not that vast. I am using Java 8, with IntelliJ as my IDE.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;
public class Main
public static void main(String args)
//This is the map that will contain each word
Map<String, Integer> map = new HashMap<>();
//int that will count how manny times the word is in the File set 2
int wordCounter = 0;
//List that contain arround 500k unrepeted words
List<String> list = new ArrayList<>();
//List that contains the current file words
List<String> list1 = new ArrayList<>();
try
//scans the file that contains the 500k unrepeted words
Scanner s = new Scanner(new File("C:\Users\filepath"));
//while loop that add the words to a list so it can manipulate it latter on
while (s.hasNext())
list.add(s.next());
//random output to see the Set size
System.out.println(list.size());
//main loop that will cheek each word in the 500k file
for (int i = 0; i < list.size(); i++)
//loop to se each file of words
for (int j = 0; j < 100; j++)
try
//read each file
Scanner d = new Scanner(new File("C:\Users\filepath" + j));
//add the information of each file
while (d.hasNext())
list1.add(d.next());
d.close();
//this code counts the number of words in all the files a
wordCounter = wordCounter + Collections.frequency(list1, list.get(i).toLowerCase());
//clears the list so it has more space and not run out of it
list1.clear();
catch (IOException k)
k.printStackTrace();
//adds the information to the map
map.put(list.get(i), wordCounter);
//this sorts the information and discard the words that only has 1 or less matches
if (wordCounter > 1)
try
FileWriter fw = new FileWriter("C:\Users\filePath", true);
PrintWriter pw = new PrintWriter(fw);
pw.append("n");
pw.append(map.toString());
pw.close();
catch (IOException f)
f.printStackTrace();
//this clean the map so it doesnt run out of memory
map.clear();
//resets the counter to 0
wordCounter = 0;
//simple display so it seems nice
System.out.println(i);
catch (IOException f)
f.printStackTrace();
Somewhere I read that because of Java using a virtual machine it makes the processing of data much slower. Would that be something to consider?
java beginner time-limit-exceeded file
edited Feb 26 at 20:59
200_success
123k14142399
123k14142399
asked Feb 26 at 20:49
BLH-Maxx
111
111
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
Forgive me if I'm wrong (no java here), but the maximum array size in java is2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â Raystafarian
Feb 26 at 23:40
add a comment |Â
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
Forgive me if I'm wrong (no java here), but the maximum array size in java is2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?
â Raystafarian
Feb 26 at 23:40
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
Forgive me if I'm wrong (no java here), but the maximum array size in java is
2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?â Raystafarian
Feb 26 at 23:40
Forgive me if I'm wrong (no java here), but the maximum array size in java is
2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?â Raystafarian
Feb 26 at 23:40
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.
To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.
Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
add a comment |Â
up vote
0
down vote
First of all you should really use more meaningful names for your variables.
wordCount
says a lot more than map
, uniqueWords
a lot more than list
and same for wordsInCurrentFile
instead of list1
.
Just renaming these makes it so much easier to follow what your program is doing.
You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.
We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.
Here's the main idea:
Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);
At this point we got an entry in our count map for each of the 500k words.
for( each file with wiki text)
loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.
while(file.hasNext())
String word = file.next();
At this point we're looping over each word in the file and want to update the total count.
Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);
It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).
And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.
To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.
Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
add a comment |Â
up vote
2
down vote
You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.
To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.
Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
add a comment |Â
up vote
2
down vote
up vote
2
down vote
You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.
To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.
Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.
You should try to switch inner and outer loop, because it will be much faster to read each wikipedia articles in and count the word frequencies for all 500k words (you anyway have the 500k word list in memory all the time). What you do now is reading 500k times all articles into memory which is time consuming.
To sum up all usages of a word, you can use the map which already exists. Just read out the current sum for a given word, add the occurrences in the current article and write it back to the map. Right now you are just writing one entry to the map, convert it to a string and clear it right afterwards. I assume, you had the idea in mind to do it how I described.
Don't worry about Java execution speed in general, because the code will get compiled (just-in-time-compilation) to machine code eventually.
answered Feb 26 at 21:34
Markus Meyer
211
211
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
add a comment |Â
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
thnx for replaying. regarding the inner and outer loop how should i do that? the idea is that it should be looping the 173 times ( the number of files i have that contains the wikipedia articles. ) regarding the map actually is saving it into a new file that will contain all the words and the number of occurrence they have in the 173 files for example First word is "a" it appears 1202 times i save that in a file and then it goes to word "aa" it appears 0 times so its discarded. and so on for 500,000 times.
â BLH-Maxx
Feb 26 at 21:53
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
First, switch the two for-loops. Then move the part where the wikipedia article is read out of the inner loop. In the inner loop, count the word occurrences in one article, read out the old sum from the map, add both, and write the new sum back to the map. But, if the current word count is < 1, don't add anything.
â Markus Meyer
Feb 26 at 22:09
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
the thing is that its not 1 wikipedia article they are 173 files containing 500 articles each so the loop is needed to read each file.
â BLH-Maxx
Feb 26 at 22:15
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
And yes i am only using 1 word in the map each time i print it in a 3rd file and then clear all
â BLH-Maxx
Feb 26 at 22:21
add a comment |Â
up vote
0
down vote
First of all you should really use more meaningful names for your variables.
wordCount
says a lot more than map
, uniqueWords
a lot more than list
and same for wordsInCurrentFile
instead of list1
.
Just renaming these makes it so much easier to follow what your program is doing.
You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.
We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.
Here's the main idea:
Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);
At this point we got an entry in our count map for each of the 500k words.
for( each file with wiki text)
loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.
while(file.hasNext())
String word = file.next();
At this point we're looping over each word in the file and want to update the total count.
Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);
It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).
And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.
add a comment |Â
up vote
0
down vote
First of all you should really use more meaningful names for your variables.
wordCount
says a lot more than map
, uniqueWords
a lot more than list
and same for wordsInCurrentFile
instead of list1
.
Just renaming these makes it so much easier to follow what your program is doing.
You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.
We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.
Here's the main idea:
Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);
At this point we got an entry in our count map for each of the 500k words.
for( each file with wiki text)
loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.
while(file.hasNext())
String word = file.next();
At this point we're looping over each word in the file and want to update the total count.
Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);
It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).
And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
First of all you should really use more meaningful names for your variables.
wordCount
says a lot more than map
, uniqueWords
a lot more than list
and same for wordsInCurrentFile
instead of list1
.
Just renaming these makes it so much easier to follow what your program is doing.
You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.
We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.
Here's the main idea:
Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);
At this point we got an entry in our count map for each of the 500k words.
for( each file with wiki text)
loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.
while(file.hasNext())
String word = file.next();
At this point we're looping over each word in the file and want to update the total count.
Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);
It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).
And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.
First of all you should really use more meaningful names for your variables.
wordCount
says a lot more than map
, uniqueWords
a lot more than list
and same for wordsInCurrentFile
instead of list1
.
Just renaming these makes it so much easier to follow what your program is doing.
You should follow Markus's advice to flip the loops. In the outer loop, you should iterate over each file. And then for each file, count the occurences for each of the words.
We can also optimise the use of our variables a bit so we don't even need the 2 lists at all.
Here's the main idea:
Map<String, int> wordCount = new HashMap<>();
File wordFile = new File( ...); //open the file with unique words
while(wordFile.hasNext())
wordCount.put(wordFile.next(), 0);
At this point we got an entry in our count map for each of the 500k words.
for( each file with wiki text)
loop over each file in outer loop. Either the same way, opening the file like you did now based on a fixed name and number appended. Or by putting all the files in a specific directory and in java iterate over all files you find in that directory.
while(file.hasNext())
String word = file.next();
At this point we're looping over each word in the file and want to update the total count.
Map<String, Integer> wordCount = new HashMap<>();
if (wordCount.containsKey(word))
wordCount.put(word, wordCount.get(word)+1);
It may also be worth looking up how to list all files in a directory in java 8 (I'm not familiar with this myself yet).
And it's also a good idea to look up how to use a try-with-resources since java 7 which helps you handling closing the file after you're done with it. The way you wrote it now doesn't properly close the file if you encounter an error.
answered Feb 27 at 15:43
Imus
3,328223
3,328223
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188399%2fword-counter-using-a-word-list-and-some-text-files%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
So you load each of the 173 files into an array, do your thing, and then load the next?
â Raystafarian
Feb 26 at 22:01
yes but one by one because if i load them all at the same time it will get out of memory.
â BLH-Maxx
Feb 26 at 22:06
You say 4 seconds per word, so you do one word 173 times, then the next word 173 times, etc?
â Raystafarian
Feb 26 at 22:08
That's the only solution i could find so far so yea there are a lot of loops. example. first word "book" its looks in file 1 then file 2 then file 3... and so far until file 173 then sums all of the words of each file put it then on a map ( i use a map because i can put 2 values in it ) then i print the map in a 3rd file. clear all the data then continue with the next word exp "balloon" reapet and so on
â BLH-Maxx
Feb 26 at 22:17
Forgive me if I'm wrong (no java here), but the maximum array size in java is
2 147 483 639
items? So unless your paragraphs have over 4.2M words each, you should be able to read an entire article into a single array, right?â Raystafarian
Feb 26 at 23:40