Scan files and pick overlapped hex codes

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

Problem

Two folders having files with nameEfile that store lines starting with hex codes.

Print those hex codes that overlap in both folders.

Solution

import os
import sys
import re

def process_cache(cache):
 event_code_set = set()
 for file_name, multi_line_content in cache.items():
 if file_name.endswith('Efile'):
 for line in multi_line_content.splitlines():
 line = line.rstrip('\') # trailing backslash, if exist
 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])
 return event_code_set



def scan_files(dir):
 cache = 
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name in ('Efile'):
 path = os.path.join(root, name)
 with open(path,'r') as file:
 cache[path] = file.read()
 return cache

cache1 = scan_files(sys.argv[1])
cache2 = scan_files(sys.argv[2])

cache1_event_code_set = process_cache(cache1)
cache2_event_code_set = process_cache(cache2)

overlap_event_codes = cache1_event_code_set & cache2_event_code_set 
print(overlap_event_codes)

For a typical three entries in a file, with comments(#),

0x00010d35 D 11 G 3,0x10009,N R XY.Condition, "(a 0x40001 == H 0x166) && (a 0x11ff8 == I 15)","0x0763ffc2 "
# Below event codes are from vendor xyz
0x84900c5 M 22 Y 1,0x03330469,4,5,6,7,8
0x04b60ff6 L 50 U 
 0x04c60e07,102 && (a 0x11ff8 == I 15)","0x0763ffc2 "

Picking 0x00010d35, 0x04b60ff6 & 0x84900c5 is the task. Rest of the line is supposed to be ignored

Some entries are multi-line with trailing backslash.

Each file is in megabytes. Total files in both folders - 80

1) Please suggest optimization in the below code, because pattern check is done twice.

 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])

2) Please suggest coding style optimizations for tree walk, cache and command line arguments.

edited Jul 21 at 13:46

Mast

7,30663483

asked Jul 18 at 15:03

overexchange

1,59021945

Returning cache is a bug and wrong practice, despite it works... cache1 should not point to local cache
â€“Â overexchange
Jul 21 at 12:48

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
â€“Â Mast
Jul 21 at 13:46

add a commentÂ |Â

up vote
4
down vote

favorite

Problem

Two folders having files with nameEfile that store lines starting with hex codes.

Print those hex codes that overlap in both folders.

Solution

import os
import sys
import re

def process_cache(cache):
 event_code_set = set()
 for file_name, multi_line_content in cache.items():
 if file_name.endswith('Efile'):
 for line in multi_line_content.splitlines():
 line = line.rstrip('\') # trailing backslash, if exist
 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])
 return event_code_set



def scan_files(dir):
 cache = 
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name in ('Efile'):
 path = os.path.join(root, name)
 with open(path,'r') as file:
 cache[path] = file.read()
 return cache

cache1 = scan_files(sys.argv[1])
cache2 = scan_files(sys.argv[2])

cache1_event_code_set = process_cache(cache1)
cache2_event_code_set = process_cache(cache2)

overlap_event_codes = cache1_event_code_set & cache2_event_code_set 
print(overlap_event_codes)

For a typical three entries in a file, with comments(#),

0x00010d35 D 11 G 3,0x10009,N R XY.Condition, "(a 0x40001 == H 0x166) && (a 0x11ff8 == I 15)","0x0763ffc2 "
# Below event codes are from vendor xyz
0x84900c5 M 22 Y 1,0x03330469,4,5,6,7,8
0x04b60ff6 L 50 U 
 0x04c60e07,102 && (a 0x11ff8 == I 15)","0x0763ffc2 "

Picking 0x00010d35, 0x04b60ff6 & 0x84900c5 is the task. Rest of the line is supposed to be ignored

Some entries are multi-line with trailing backslash.

Each file is in megabytes. Total files in both folders - 80

1) Please suggest optimization in the below code, because pattern check is done twice.

 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])

2) Please suggest coding style optimizations for tree walk, cache and command line arguments.

edited Jul 21 at 13:46

Mast

7,30663483

asked Jul 18 at 15:03

overexchange

1,59021945

Returning cache is a bug and wrong practice, despite it works... cache1 should not point to local cache
â€“Â overexchange
Jul 21 at 12:48

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
â€“Â Mast
Jul 21 at 13:46

add a commentÂ |Â

up vote
4
down vote

favorite

Problem

Two folders having files with nameEfile that store lines starting with hex codes.

Print those hex codes that overlap in both folders.

Solution

import os
import sys
import re

def process_cache(cache):
 event_code_set = set()
 for file_name, multi_line_content in cache.items():
 if file_name.endswith('Efile'):
 for line in multi_line_content.splitlines():
 line = line.rstrip('\') # trailing backslash, if exist
 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])
 return event_code_set



def scan_files(dir):
 cache = 
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name in ('Efile'):
 path = os.path.join(root, name)
 with open(path,'r') as file:
 cache[path] = file.read()
 return cache

cache1 = scan_files(sys.argv[1])
cache2 = scan_files(sys.argv[2])

cache1_event_code_set = process_cache(cache1)
cache2_event_code_set = process_cache(cache2)

overlap_event_codes = cache1_event_code_set & cache2_event_code_set 
print(overlap_event_codes)

For a typical three entries in a file, with comments(#),

0x00010d35 D 11 G 3,0x10009,N R XY.Condition, "(a 0x40001 == H 0x166) && (a 0x11ff8 == I 15)","0x0763ffc2 "
# Below event codes are from vendor xyz
0x84900c5 M 22 Y 1,0x03330469,4,5,6,7,8
0x04b60ff6 L 50 U 
 0x04c60e07,102 && (a 0x11ff8 == I 15)","0x0763ffc2 "

Picking 0x00010d35, 0x04b60ff6 & 0x84900c5 is the task. Rest of the line is supposed to be ignored

Some entries are multi-line with trailing backslash.

Each file is in megabytes. Total files in both folders - 80

1) Please suggest optimization in the below code, because pattern check is done twice.

 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])

2) Please suggest coding style optimizations for tree walk, cache and command line arguments.

edited Jul 21 at 13:46

Mast

7,30663483

asked Jul 18 at 15:03

overexchange

1,59021945

Problem

Two folders having files with nameEfile that store lines starting with hex codes.

Print those hex codes that overlap in both folders.

Solution

import os
import sys
import re

def process_cache(cache):
 event_code_set = set()
 for file_name, multi_line_content in cache.items():
 if file_name.endswith('Efile'):
 for line in multi_line_content.splitlines():
 line = line.rstrip('\') # trailing backslash, if exist
 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])
 return event_code_set



def scan_files(dir):
 cache = 
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name in ('Efile'):
 path = os.path.join(root, name)
 with open(path,'r') as file:
 cache[path] = file.read()
 return cache

cache1 = scan_files(sys.argv[1])
cache2 = scan_files(sys.argv[2])

cache1_event_code_set = process_cache(cache1)
cache2_event_code_set = process_cache(cache2)

overlap_event_codes = cache1_event_code_set & cache2_event_code_set 
print(overlap_event_codes)

For a typical three entries in a file, with comments(#),

0x00010d35 D 11 G 3,0x10009,N R XY.Condition, "(a 0x40001 == H 0x166) && (a 0x11ff8 == I 15)","0x0763ffc2 "
# Below event codes are from vendor xyz
0x84900c5 M 22 Y 1,0x03330469,4,5,6,7,8
0x04b60ff6 L 50 U 
 0x04c60e07,102 && (a 0x11ff8 == I 15)","0x0763ffc2 "

Picking 0x00010d35, 0x04b60ff6 & 0x84900c5 is the task. Rest of the line is supposed to be ignored

Some entries are multi-line with trailing backslash.

Each file is in megabytes. Total files in both folders - 80

1) Please suggest optimization in the below code, because pattern check is done twice.

 if bool(re.search(r'^0[xX][0-9a-fA-F]+', line)):
 # Take the hexcode
 obj = re.search(r'^0[xX][0-9a-fA-F]+', line)
 event_code_set.add(obj.string[obj.start():obj.end()])

2) Please suggest coding style optimizations for tree walk, cache and command line arguments.

edited Jul 21 at 13:46

Mast

7,30663483

asked Jul 18 at 15:03

overexchange

1,59021945

edited Jul 21 at 13:46

Mast

7,30663483

edited Jul 21 at 13:46

Mast

7,30663483

edited Jul 21 at 13:46

Mast

7,30663483

asked Jul 18 at 15:03

overexchange

1,59021945

asked Jul 18 at 15:03

overexchange

1,59021945

asked Jul 18 at 15:03

overexchange

1,59021945

Returning cache is a bug and wrong practice, despite it works... cache1 should not point to local cache
â€“Â overexchange
Jul 21 at 12:48

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
â€“Â Mast
Jul 21 at 13:46

add a commentÂ |Â

Returning cache is a bug and wrong practice, despite it works... cache1 should not point to local cache
â€“Â overexchange
Jul 21 at 12:48

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
â€“Â Mast
Jul 21 at 13:46

Returning cache is a bug and wrong practice, despite it works... cache1 should not point to local cache
â€“Â overexchange
Jul 21 at 12:48

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
â€“Â Mast
Jul 21 at 13:46

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Possible bugs

In one place, you test for if file_name.endswith('Efile'), and elsewhere, you test for if name in ('Efile'). What do you really intend to do? Are they supposed to be the same test? Can't you just test the filename once? The name in ('Efile') test is particularly weird, since 'fil' would pass the test.

Your regex is case insensitive, which implies that you expect to be able to handle both uppercase and lowercase hex strings. But your set operations would treat uppercase and lowercase versions of the same hex code as distinct from each other, which is counterintuitive.

Your sample file shows that the hex codes may have leading zeroes (e.g. 0x00010d35), and that the codes may have varying lengths. So, 0x00010d35 would be treated differently from 0x10d35, which is counterintuitive.

The solution to both of these issues is to normalize the codes when constructing the set. One way would be to strip any leading zeroes and convert the string to lowercase. A better efficient solution would be to parse them as integers, since integer comparisons would be more efficient than string comparisons.

What's the point of line = line.rstrip('\')? But what's the point of meddling with the ends of the lines, when you only care about the beginnings of the lines?

On the other hand, your example file suggests that a backslash at the end of the line would be used to indicate that the following line is a continuation of the same record. In that case, what if the continuation line starts with something that looks like a hex code (with no leading whitespace)? Would you count that or not?

Efficiency

You said that each of the ~100 files may be several megabytes long. Your cache object, reads the entire contents of all of the files into memory! There is no good reason to store that much data, when all you care about is the hex codes at the beginning of each line.

If you want to split the work into two functions, I would have one of them be responsible for discovering the relevant file paths, and the other one responsible for extracting the codes.

If you run a regex search many times, it's worth compiling it first. I would design the regex so that it works on the entire file contents at once, rather than line by line.

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199759%2fscan-files-and-pick-overlapped-hex-codes%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Possible bugs

In one place, you test for if file_name.endswith('Efile'), and elsewhere, you test for if name in ('Efile'). What do you really intend to do? Are they supposed to be the same test? Can't you just test the filename once? The name in ('Efile') test is particularly weird, since 'fil' would pass the test.

Your regex is case insensitive, which implies that you expect to be able to handle both uppercase and lowercase hex strings. But your set operations would treat uppercase and lowercase versions of the same hex code as distinct from each other, which is counterintuitive.

Your sample file shows that the hex codes may have leading zeroes (e.g. 0x00010d35), and that the codes may have varying lengths. So, 0x00010d35 would be treated differently from 0x10d35, which is counterintuitive.

The solution to both of these issues is to normalize the codes when constructing the set. One way would be to strip any leading zeroes and convert the string to lowercase. A better efficient solution would be to parse them as integers, since integer comparisons would be more efficient than string comparisons.

What's the point of line = line.rstrip('\')? But what's the point of meddling with the ends of the lines, when you only care about the beginnings of the lines?

On the other hand, your example file suggests that a backslash at the end of the line would be used to indicate that the following line is a continuation of the same record. In that case, what if the continuation line starts with something that looks like a hex code (with no leading whitespace)? Would you count that or not?

Efficiency

If you want to split the work into two functions, I would have one of them be responsible for discovering the relevant file paths, and the other one responsible for extracting the codes.

If you run a regex search many times, it's worth compiling it first. I would design the regex so that it works on the entire file contents at once, rather than line by line.

Possible bugs

In one place, you test for if file_name.endswith('Efile'), and elsewhere, you test for if name in ('Efile'). What do you really intend to do? Are they supposed to be the same test? Can't you just test the filename once? The name in ('Efile') test is particularly weird, since 'fil' would pass the test.

Your regex is case insensitive, which implies that you expect to be able to handle both uppercase and lowercase hex strings. But your set operations would treat uppercase and lowercase versions of the same hex code as distinct from each other, which is counterintuitive.

Your sample file shows that the hex codes may have leading zeroes (e.g. 0x00010d35), and that the codes may have varying lengths. So, 0x00010d35 would be treated differently from 0x10d35, which is counterintuitive.

The solution to both of these issues is to normalize the codes when constructing the set. One way would be to strip any leading zeroes and convert the string to lowercase. A better efficient solution would be to parse them as integers, since integer comparisons would be more efficient than string comparisons.

What's the point of line = line.rstrip('\')? But what's the point of meddling with the ends of the lines, when you only care about the beginnings of the lines?

On the other hand, your example file suggests that a backslash at the end of the line would be used to indicate that the following line is a continuation of the same record. In that case, what if the continuation line starts with something that looks like a hex code (with no leading whitespace)? Would you count that or not?

Efficiency

If you want to split the work into two functions, I would have one of them be responsible for discovering the relevant file paths, and the other one responsible for extracting the codes.

If you run a regex search many times, it's worth compiling it first. I would design the regex so that it works on the entire file contents at once, rather than line by line.

Possible bugs

In one place, you test for if file_name.endswith('Efile'), and elsewhere, you test for if name in ('Efile'). What do you really intend to do? Are they supposed to be the same test? Can't you just test the filename once? The name in ('Efile') test is particularly weird, since 'fil' would pass the test.

Your regex is case insensitive, which implies that you expect to be able to handle both uppercase and lowercase hex strings. But your set operations would treat uppercase and lowercase versions of the same hex code as distinct from each other, which is counterintuitive.

Your sample file shows that the hex codes may have leading zeroes (e.g. 0x00010d35), and that the codes may have varying lengths. So, 0x00010d35 would be treated differently from 0x10d35, which is counterintuitive.

The solution to both of these issues is to normalize the codes when constructing the set. One way would be to strip any leading zeroes and convert the string to lowercase. A better efficient solution would be to parse them as integers, since integer comparisons would be more efficient than string comparisons.

What's the point of line = line.rstrip('\')? But what's the point of meddling with the ends of the lines, when you only care about the beginnings of the lines?

On the other hand, your example file suggests that a backslash at the end of the line would be used to indicate that the following line is a continuation of the same record. In that case, what if the continuation line starts with something that looks like a hex code (with no leading whitespace)? Would you count that or not?

Efficiency

If you want to split the work into two functions, I would have one of them be responsible for discovering the relevant file paths, and the other one responsible for extracting the codes.

If you run a regex search many times, it's worth compiling it first. I would design the regex so that it works on the entire file contents at once, rather than line by line.

Possible bugs

In one place, you test for if file_name.endswith('Efile'), and elsewhere, you test for if name in ('Efile'). What do you really intend to do? Are they supposed to be the same test? Can't you just test the filename once? The name in ('Efile') test is particularly weird, since 'fil' would pass the test.

Your regex is case insensitive, which implies that you expect to be able to handle both uppercase and lowercase hex strings. But your set operations would treat uppercase and lowercase versions of the same hex code as distinct from each other, which is counterintuitive.

Your sample file shows that the hex codes may have leading zeroes (e.g. 0x00010d35), and that the codes may have varying lengths. So, 0x00010d35 would be treated differently from 0x10d35, which is counterintuitive.

The solution to both of these issues is to normalize the codes when constructing the set. One way would be to strip any leading zeroes and convert the string to lowercase. A better efficient solution would be to parse them as integers, since integer comparisons would be more efficient than string comparisons.

What's the point of line = line.rstrip('\')? But what's the point of meddling with the ends of the lines, when you only care about the beginnings of the lines?

On the other hand, your example file suggests that a backslash at the end of the line would be used to indicate that the following line is a continuation of the same record. In that case, what if the continuation line starts with something that looks like a hex code (with no leading whitespace)? Would you count that or not?

Efficiency

If you want to split the work into two functions, I would have one of them be responsible for discovering the relevant file paths, and the other one responsible for extracting the codes.

If you run a regex search many times, it's worth compiling it first. I would design the regex so that it works on the entire file contents at once, rather than line by line.

Suggested solution

import os
import re
import sys

def efiles(dir):
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name.endswith('Efile'):
 yield os.path.join(root, name)

def event_codes(file_paths):
 hex_re = re.compile(r'(?<!\n)^0[xX][0-9a-fA-F]+', re.MULTILINE)
 for path in file_paths:
 with open(path) as f:
 for event_code in hex_re.findall(f.read()):
 yield int(event_code, 16)

event_code_set1 = set(event_codes(efiles(sys.argv[1])))
event_code_set2 = set(event_codes(efiles(sys.argv[2])))
overlap_event_codes = set(hex(n) for n in (event_code_set1 & event_code_set2))
print(overlap_event_codes)

answered Jul 18 at 22:14

200_success

123k14143399

answered Jul 18 at 22:14

200_success

123k14143399

answered Jul 18 at 22:14

200_success

123k14143399

answered Jul 18 at 22:14

200_success

123k14143399

for rstrip? stackoverflow.com/q/51409385/3317808
â€“Â overexchange
Jul 19 at 14:10

continuation line starts with something that looks like a hex code (with no leading whitespace)? Am not suppose to count... you can say my code is relying on indentation with white space... which is vulnerable... How do I deal with this? Do i need to remember previous line read?
â€“Â overexchange
Jul 19 at 14:11

I actually do not need endwith because the file name must be Efile.. So name == 'Efile' should work
â€“Â overexchange
Jul 19 at 14:16

Yes, I am saying that you are relying on the whitespace. My suggested solution solves it by using a negative look-behind assertion in the regex, and applying the regex to the entire file at once rather than line by line.
â€“Â 200_success
Jul 19 at 14:22

As file is MB size, can I avoid opening of complete file?
â€“Â overexchange
Jul 19 at 23:25

Â |Â
show 2 more comments

for rstrip? stackoverflow.com/q/51409385/3317808
â€“Â overexchange
Jul 19 at 14:10

continuation line starts with something that looks like a hex code (with no leading whitespace)? Am not suppose to count... you can say my code is relying on indentation with white space... which is vulnerable... How do I deal with this? Do i need to remember previous line read?
â€“Â overexchange
Jul 19 at 14:11

I actually do not need endwith because the file name must be Efile.. So name == 'Efile' should work
â€“Â overexchange
Jul 19 at 14:16

Yes, I am saying that you are relying on the whitespace. My suggested solution solves it by using a negative look-behind assertion in the regex, and applying the regex to the entire file at once rather than line by line.
â€“Â 200_success
Jul 19 at 14:22

As file is MB size, can I avoid opening of complete file?
â€“Â overexchange
Jul 19 at 23:25

for rstrip? stackoverflow.com/q/51409385/3317808
â€“Â overexchange
Jul 19 at 14:10

continuation line starts with something that looks like a hex code (with no leading whitespace)? Am not suppose to count... you can say my code is relying on indentation with white space... which is vulnerable... How do I deal with this? Do i need to remember previous line read?
â€“Â overexchange
Jul 19 at 14:11

I actually do not need endwith because the file name must be Efile.. So name == 'Efile' should work
â€“Â overexchange
Jul 19 at 14:16

Yes, I am saying that you are relying on the whitespace. My suggested solution solves it by using a negative look-behind assertion in the regex, and applying the regex to the entire file at once rather than line by line.
â€“Â 200_success
Jul 19 at 14:22

As file is MB size, can I avoid opening of complete file?
â€“Â overexchange
Jul 19 at 23:25

Â |Â
show 2 more comments

up vote
5
down vote

Repeating the match can be avoided by storing it in a variable first:

match = re.search(r'^0[xX][0-9a-fA-F]+', line)
if match:
 event_codes.add(match.group())

Python also caches your regex, once you have used it, but you may still squeeze out a bit of performance with re.compile.

I also used the match.group() method, which immediately returns the matching part if the string.

import os
import sys
import re

def event_codes_file(path):
 with open(path) as file:
 for line in file:
 match = re.match(r'0[xX][0-9a-fA-F]+', line)
 if match:
 yield match.group()

def event_codes_dir(dir):
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name.endswith('Efile'):
 path = os.path.join(root, name)
 yield from event_codes_file(path)

if __name__ == "__main__":
 event_codes1 = set(event_codes_dir(sys.argv[1]))
 event_codes2 = set(event_codes_dir(sys.argv[2]))

 overlap_event_codes = event_codes1 & event_codes2
 print(overlap_event_codes)

yield from x was introduced in Python 3.3 and is mostly equivalent to for i in x: yield i.

On a file "test", containing your example content, this produces:

set(event_codes_file("test"))
# '0x00010d35', '0x04b60ff6', '0x84900c5'

edited Jul 19 at 7:24

answered Jul 18 at 16:34

Graipher

20.4k42981

Is Generator enabling the code to avoid cache?
â€“Â overexchange
Jul 18 at 17:17

@overexchange Basically yes, but it could have been done differently. The important part is processing each file as you discover it, avoiding having to load all of them into memory at once. The generator really only avoids having to carry around a separate set per file.
â€“Â Graipher
Jul 18 at 17:31

Yes... Pipe line processing...
â€“Â overexchange
Jul 18 at 17:44

Does line not have trailing backslash? Because you are passing to re.search(). re.search(r'\$', 'hellothere') does not work
â€“Â overexchange
Jul 18 at 18:51

2

name in ('Efile') is not the same as name == 'Efile'. For example, 'fi' in ('Efile') is True.
â€“Â 200_success
Jul 18 at 21:28

Â |Â
show 9 more comments

up vote
5
down vote

Repeating the match can be avoided by storing it in a variable first:

match = re.search(r'^0[xX][0-9a-fA-F]+', line)
if match:
 event_codes.add(match.group())

Python also caches your regex, once you have used it, but you may still squeeze out a bit of performance with re.compile.

I also used the match.group() method, which immediately returns the matching part if the string.

import os
import sys
import re

def event_codes_file(path):
 with open(path) as file:
 for line in file:
 match = re.match(r'0[xX][0-9a-fA-F]+', line)
 if match:
 yield match.group()

def event_codes_dir(dir):
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name.endswith('Efile'):
 path = os.path.join(root, name)
 yield from event_codes_file(path)

if __name__ == "__main__":
 event_codes1 = set(event_codes_dir(sys.argv[1]))
 event_codes2 = set(event_codes_dir(sys.argv[2]))

 overlap_event_codes = event_codes1 & event_codes2
 print(overlap_event_codes)

yield from x was introduced in Python 3.3 and is mostly equivalent to for i in x: yield i.

On a file "test", containing your example content, this produces:

set(event_codes_file("test"))
# '0x00010d35', '0x04b60ff6', '0x84900c5'

edited Jul 19 at 7:24

answered Jul 18 at 16:34

Graipher

20.4k42981

Is Generator enabling the code to avoid cache?
â€“Â overexchange
Jul 18 at 17:17

@overexchange Basically yes, but it could have been done differently. The important part is processing each file as you discover it, avoiding having to load all of them into memory at once. The generator really only avoids having to carry around a separate set per file.
â€“Â Graipher
Jul 18 at 17:31

Yes... Pipe line processing...
â€“Â overexchange
Jul 18 at 17:44

Does line not have trailing backslash? Because you are passing to re.search(). re.search(r'\$', 'hellothere') does not work
â€“Â overexchange
Jul 18 at 18:51

2

name in ('Efile') is not the same as name == 'Efile'. For example, 'fi' in ('Efile') is True.
â€“Â 200_success
Jul 18 at 21:28

Â |Â
show 9 more comments

up vote
5
down vote

Repeating the match can be avoided by storing it in a variable first:

match = re.search(r'^0[xX][0-9a-fA-F]+', line)
if match:
 event_codes.add(match.group())

Python also caches your regex, once you have used it, but you may still squeeze out a bit of performance with re.compile.

I also used the match.group() method, which immediately returns the matching part if the string.

import os
import sys
import re

def event_codes_file(path):
 with open(path) as file:
 for line in file:
 match = re.match(r'0[xX][0-9a-fA-F]+', line)
 if match:
 yield match.group()

def event_codes_dir(dir):
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name.endswith('Efile'):
 path = os.path.join(root, name)
 yield from event_codes_file(path)

if __name__ == "__main__":
 event_codes1 = set(event_codes_dir(sys.argv[1]))
 event_codes2 = set(event_codes_dir(sys.argv[2]))

 overlap_event_codes = event_codes1 & event_codes2
 print(overlap_event_codes)

yield from x was introduced in Python 3.3 and is mostly equivalent to for i in x: yield i.

On a file "test", containing your example content, this produces:

set(event_codes_file("test"))
# '0x00010d35', '0x04b60ff6', '0x84900c5'

edited Jul 19 at 7:24

answered Jul 18 at 16:34

Graipher

20.4k42981

Repeating the match can be avoided by storing it in a variable first:

match = re.search(r'^0[xX][0-9a-fA-F]+', line)
if match:
 event_codes.add(match.group())

Python also caches your regex, once you have used it, but you may still squeeze out a bit of performance with re.compile.

I also used the match.group() method, which immediately returns the matching part if the string.

import os
import sys
import re

def event_codes_file(path):
 with open(path) as file:
 for line in file:
 match = re.match(r'0[xX][0-9a-fA-F]+', line)
 if match:
 yield match.group()

def event_codes_dir(dir):
 for root, dirs, files in os.walk(dir):
 for name in files:
 if name.endswith('Efile'):
 path = os.path.join(root, name)
 yield from event_codes_file(path)

if __name__ == "__main__":
 event_codes1 = set(event_codes_dir(sys.argv[1]))
 event_codes2 = set(event_codes_dir(sys.argv[2]))

 overlap_event_codes = event_codes1 & event_codes2
 print(overlap_event_codes)

yield from x was introduced in Python 3.3 and is mostly equivalent to for i in x: yield i.

On a file "test", containing your example content, this produces:

set(event_codes_file("test"))
# '0x00010d35', '0x04b60ff6', '0x84900c5'

edited Jul 19 at 7:24

answered Jul 18 at 16:34

Graipher

20.4k42981

edited Jul 19 at 7:24

answered Jul 18 at 16:34

Graipher

20.4k42981

answered Jul 18 at 16:34

Graipher

20.4k42981

answered Jul 18 at 16:34

Graipher

20.4k42981

Is Generator enabling the code to avoid cache?
â€“Â overexchange
Jul 18 at 17:17

@overexchange Basically yes, but it could have been done differently. The important part is processing each file as you discover it, avoiding having to load all of them into memory at once. The generator really only avoids having to carry around a separate set per file.
â€“Â Graipher
Jul 18 at 17:31

Yes... Pipe line processing...
â€“Â overexchange
Jul 18 at 17:44

Does line not have trailing backslash? Because you are passing to re.search(). re.search(r'\$', 'hellothere') does not work
â€“Â overexchange
Jul 18 at 18:51

2

name in ('Efile') is not the same as name == 'Efile'. For example, 'fi' in ('Efile') is True.
â€“Â 200_success
Jul 18 at 21:28

Â |Â
show 9 more comments

Is Generator enabling the code to avoid cache?
â€“Â overexchange
Jul 18 at 17:17

@overexchange Basically yes, but it could have been done differently. The important part is processing each file as you discover it, avoiding having to load all of them into memory at once. The generator really only avoids having to carry around a separate set per file.
â€“Â Graipher
Jul 18 at 17:31

Yes... Pipe line processing...
â€“Â overexchange
Jul 18 at 17:44

Does line not have trailing backslash? Because you are passing to re.search(). re.search(r'\$', 'hellothere') does not work
â€“Â overexchange
Jul 18 at 18:51

2

name in ('Efile') is not the same as name == 'Efile'. For example, 'fi' in ('Efile') is True.
â€“Â 200_success
Jul 18 at 21:28

Is Generator enabling the code to avoid cache?
â€“Â overexchange
Jul 18 at 17:17

@overexchange Basically yes, but it could have been done differently. The important part is processing each file as you discover it, avoiding having to load all of them into memory at once. The generator really only avoids having to carry around a separate set per file.
â€“Â Graipher
Jul 18 at 17:31

Yes... Pipe line processing...
â€“Â overexchange
Jul 18 at 17:44

Does line not have trailing backslash? Because you are passing to re.search(). re.search(r'\$', 'hellothere') does not work
â€“Â overexchange
Jul 18 at 18:51

name in ('Efile') is not the same as name == 'Efile'. For example, 'fi' in ('Efile') is True.
â€“Â 200_success
Jul 18 at 21:28

Â |Â
show 9 more comments

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr

Scan files and pick overlapped hex codes

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

2 Answers
2

Possible bugs

Efficiency

Suggested solution

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Post as a guest

Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods

Scan files and pick overlapped hex codes

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

2 Answers 2

Possible bugs

Efficiency

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Possible bugs

Efficiency

Suggested solution

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods

2 Answers
2

2 Answers
2

2 Answers
2