Optimizing MD5 OpenSSL Implementation in C for Precomputed Hash

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.

The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.

I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).

I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.

I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.

Any suggestion to increase the performance?

Bash Version:

#/!bin/bash

while read line

do

printf $line | md5

done

Python Version:

import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

 for line in inf:

 outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')

C Version:

//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
 FILE *infile, *outfile;
 char *string = NULL;
 size_t len = 0;
 ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

 //Part of MD5 Hash Function (Taken out of While Loop for Optimization)
 int md5;
 unsigned char result[MD5_DIGEST_LENGTH];
 //

 infile = fopen("file.txt", "r");

 if (infile == NULL)
 exit(EXIT_FAILURE);

 outfile = fopen("MD.txt","w");

 if (outfile == NULL)
 exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
 while ((read = getline(&string, &len, infile)) != -1) 

 string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


 MD5(string, strlen(string), result);

 //output
 for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

 fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

 fprintf(outfile,"n"); //newline for the output file

 

 free(string); //free string
 fclose(infile); // close file streams
 fclose(outfile);
 exit(EXIT_SUCCESS); //Program Ends

I can provide the execution time with different size of input if requested. Any help will be appreciated.

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

asked Apr 17 at 16:18

Alex Lo

587

Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â€“Â Martin R
Apr 17 at 18:40

add a commentÂ |Â

up vote
4
down vote

favorite

I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.

The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.

I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.

I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.

Any suggestion to increase the performance?

Bash Version:

#/!bin/bash

while read line

do

printf $line | md5

done

Python Version:

import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

 for line in inf:

 outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')

C Version:

//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
 FILE *infile, *outfile;
 char *string = NULL;
 size_t len = 0;
 ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

 //Part of MD5 Hash Function (Taken out of While Loop for Optimization)
 int md5;
 unsigned char result[MD5_DIGEST_LENGTH];
 //

 infile = fopen("file.txt", "r");

 if (infile == NULL)
 exit(EXIT_FAILURE);

 outfile = fopen("MD.txt","w");

 if (outfile == NULL)
 exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
 while ((read = getline(&string, &len, infile)) != -1) 

 string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


 MD5(string, strlen(string), result);

 //output
 for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

 fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

 fprintf(outfile,"n"); //newline for the output file

 

 free(string); //free string
 fclose(infile); // close file streams
 fclose(outfile);
 exit(EXIT_SUCCESS); //Program Ends

I can provide the execution time with different size of input if requested. Any help will be appreciated.

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

asked Apr 17 at 16:18

Alex Lo

587

Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â€“Â Martin R
Apr 17 at 18:40

add a commentÂ |Â

up vote
4
down vote

favorite

I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.

The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.

I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.

I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.

Any suggestion to increase the performance?

Bash Version:

#/!bin/bash

while read line

do

printf $line | md5

done

Python Version:

import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

 for line in inf:

 outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')

C Version:

//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
 FILE *infile, *outfile;
 char *string = NULL;
 size_t len = 0;
 ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

 //Part of MD5 Hash Function (Taken out of While Loop for Optimization)
 int md5;
 unsigned char result[MD5_DIGEST_LENGTH];
 //

 infile = fopen("file.txt", "r");

 if (infile == NULL)
 exit(EXIT_FAILURE);

 outfile = fopen("MD.txt","w");

 if (outfile == NULL)
 exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
 while ((read = getline(&string, &len, infile)) != -1) 

 string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


 MD5(string, strlen(string), result);

 //output
 for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

 fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

 fprintf(outfile,"n"); //newline for the output file

 

 free(string); //free string
 fclose(infile); // close file streams
 fclose(outfile);
 exit(EXIT_SUCCESS); //Program Ends

I can provide the execution time with different size of input if requested. Any help will be appreciated.

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

asked Apr 17 at 16:18

Alex Lo

587

I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.

The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.

I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.

I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.

Any suggestion to increase the performance?

Bash Version:

#/!bin/bash

while read line

do

printf $line | md5

done

Python Version:

import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

 for line in inf:

 outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')

C Version:

//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
 FILE *infile, *outfile;
 char *string = NULL;
 size_t len = 0;
 ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

 //Part of MD5 Hash Function (Taken out of While Loop for Optimization)
 int md5;
 unsigned char result[MD5_DIGEST_LENGTH];
 //

 infile = fopen("file.txt", "r");

 if (infile == NULL)
 exit(EXIT_FAILURE);

 outfile = fopen("MD.txt","w");

 if (outfile == NULL)
 exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
 while ((read = getline(&string, &len, infile)) != -1) 

 string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


 MD5(string, strlen(string), result);

 //output
 for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

 fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

 fprintf(outfile,"n"); //newline for the output file

 

 free(string); //free string
 fclose(infile); // close file streams
 fclose(outfile);
 exit(EXIT_SUCCESS); //Program Ends

I can provide the execution time with different size of input if requested. Any help will be appreciated.

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

asked Apr 17 at 16:18

Alex Lo

587

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

edited Apr 18 at 3:44

Jamalâ™¦

30.1k11114225

asked Apr 17 at 16:18

Alex Lo

587

asked Apr 17 at 16:18

Alex Lo

587

asked Apr 17 at 16:18

Alex Lo

587

Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â€“Â Martin R
Apr 17 at 18:40

add a commentÂ |Â

Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â€“Â Martin R
Apr 17 at 18:40

Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â€“Â Martin R
Apr 17 at 18:40

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
7
down vote

accepted

I'll start with a review of your current code before suggesting possible performance
improvements.

Various comments in your code do not add information and can be removed, for example

//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends

Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.

Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example

FILE *infile = fopen("file.txt", "r");
if (infile == NULL) 
 exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL) 
 exit(EXIT_FAILURE);

for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...

Declaring md5 at the top does not increase the performance.

Some variable names can be improved (of course this is partially opinion-based):

char *string is actually the current line.

size_t len is not the length of the current string,
but the capacity of the buffer (re)allocated by getline().

ssize_t read is Ã¢Â€Â“ strictly speaking Ã¢Â€Â“ not the number of bytes read into the
buffer because it excludes the NUL character.

int md5 is not an MD5 value but an index into the buffer containing the
MD5 hash.

I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.

Fix compiler warnings:

MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign

You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.

According to the C11 standard, the declaration of main should be one of

int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */

(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.

Performance improvements

After reading a line from the input file, the string is traversed twice:

while ((read = getline(&string, &len, infile)) != -1) 
 string[strcspn(string, "n")] = 0; // Remove newline 'n'
 MD5(string, strlen(string), result);

 // ...

First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:

while ((read = getline(&string, &len, infile)) != -1) 
 // Remove trailing newline character
 if (read > 0 && string[read - 1] == 'n') 
 read -= 1;
 string[read] = 0;
 

 MD5(string, read, result);

 // ...

However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.

In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)

This immediately revealed that most of the time is spent in fprintf():

enter image description here

Possible reasons are:

String formatting is slow.

All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.

The solution is to:

Write a custom function for converting the MD5 hash to a hex string.

Call printf for the entire string instead of each single byte so reduce the number
of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.

Putting it together

With all those modifications, we have

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer) 
 const char *hexDigits = "0123456789abcdef";
 char *dest = buffer;
 for (size_t i = 0; i < length; i++) 
 *dest++ = hexDigits[data[i] >> 4];
 *dest++ = hexDigits[data[i] & 0x0F];
 
 *dest = 0;
 return buffer;


int main(void) 
 FILE *infile = fopen("file.txt", "r");
 if (infile == NULL) 
 perror("Cannot open input file");
 exit(EXIT_FAILURE);
 
 FILE *outfile = fopen("MD.txt","w");
 if (outfile == NULL) 
 perror("Cannot open output file");
 exit(EXIT_FAILURE);
 

 // Read file line-by-line
 char *line = NULL;
 size_t linecap = 0;
 ssize_t lineLength;
 while ((lineLength = getline(&line, &linecap, infile)) != -1) 
 if (lineLength > 0 && line[lineLength - 1] == 'n') 
 // Remove newline character
 lineLength -= 1;
 line[lineLength] = 0;
 

 // Compute MD5 hash
 unsigned char md5hash[MD5_DIGEST_LENGTH];
 MD5((unsigned char*)line, lineLength, md5hash);

 // Print hash as hex string
 char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
 fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
 fputc('n', outfile);
 
 free(line);

 // Close output files
 fclose(infile);
 fclose(outfile);

Further suggestions

The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.

Implement a -h option to show a short help/usage of the program.

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

1

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

1

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

Â |Â
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192306%2foptimizing-md5-openssl-implementation-in-c-for-precomputed-hash%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
7
down vote

accepted

I'll start with a review of your current code before suggesting possible performance
improvements.

Various comments in your code do not add information and can be removed, for example

//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends

Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.

Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example

FILE *infile = fopen("file.txt", "r");
if (infile == NULL) 
 exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL) 
 exit(EXIT_FAILURE);

for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...

Declaring md5 at the top does not increase the performance.

Some variable names can be improved (of course this is partially opinion-based):

char *string is actually the current line.

size_t len is not the length of the current string,
but the capacity of the buffer (re)allocated by getline().

ssize_t read is Ã¢Â€Â“ strictly speaking Ã¢Â€Â“ not the number of bytes read into the
buffer because it excludes the NUL character.

int md5 is not an MD5 value but an index into the buffer containing the
MD5 hash.

I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.

Fix compiler warnings:

MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign

You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.

According to the C11 standard, the declaration of main should be one of

int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */

(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.

Performance improvements

After reading a line from the input file, the string is traversed twice:

while ((read = getline(&string, &len, infile)) != -1) 
 string[strcspn(string, "n")] = 0; // Remove newline 'n'
 MD5(string, strlen(string), result);

 // ...

while ((read = getline(&string, &len, infile)) != -1) 
 // Remove trailing newline character
 if (read > 0 && string[read - 1] == 'n') 
 read -= 1;
 string[read] = 0;
 

 MD5(string, read, result);

 // ...

However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.

In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)

This immediately revealed that most of the time is spent in fprintf():

enter image description here

Possible reasons are:

String formatting is slow.

All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.

The solution is to:

Write a custom function for converting the MD5 hash to a hex string.

Call printf for the entire string instead of each single byte so reduce the number
of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.

Putting it together

With all those modifications, we have

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer) 
 const char *hexDigits = "0123456789abcdef";
 char *dest = buffer;
 for (size_t i = 0; i < length; i++) 
 *dest++ = hexDigits[data[i] >> 4];
 *dest++ = hexDigits[data[i] & 0x0F];
 
 *dest = 0;
 return buffer;


int main(void) 
 FILE *infile = fopen("file.txt", "r");
 if (infile == NULL) 
 perror("Cannot open input file");
 exit(EXIT_FAILURE);
 
 FILE *outfile = fopen("MD.txt","w");
 if (outfile == NULL) 
 perror("Cannot open output file");
 exit(EXIT_FAILURE);
 

 // Read file line-by-line
 char *line = NULL;
 size_t linecap = 0;
 ssize_t lineLength;
 while ((lineLength = getline(&line, &linecap, infile)) != -1) 
 if (lineLength > 0 && line[lineLength - 1] == 'n') 
 // Remove newline character
 lineLength -= 1;
 line[lineLength] = 0;
 

 // Compute MD5 hash
 unsigned char md5hash[MD5_DIGEST_LENGTH];
 MD5((unsigned char*)line, lineLength, md5hash);

 // Print hash as hex string
 char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
 fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
 fputc('n', outfile);
 
 free(line);

 // Close output files
 fclose(infile);
 fclose(outfile);

Further suggestions

The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.

Implement a -h option to show a short help/usage of the program.

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

1

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

1

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

Â |Â
show 3 more comments

up vote
7
down vote

accepted

I'll start with a review of your current code before suggesting possible performance
improvements.

Various comments in your code do not add information and can be removed, for example

//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends

Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.

Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example

FILE *infile = fopen("file.txt", "r");
if (infile == NULL) 
 exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL) 
 exit(EXIT_FAILURE);

for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...

Declaring md5 at the top does not increase the performance.

Some variable names can be improved (of course this is partially opinion-based):

char *string is actually the current line.

size_t len is not the length of the current string,
but the capacity of the buffer (re)allocated by getline().

ssize_t read is Ã¢Â€Â“ strictly speaking Ã¢Â€Â“ not the number of bytes read into the
buffer because it excludes the NUL character.

int md5 is not an MD5 value but an index into the buffer containing the
MD5 hash.

I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.

Fix compiler warnings:

MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign

You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.

According to the C11 standard, the declaration of main should be one of

int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */

(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.

Performance improvements

After reading a line from the input file, the string is traversed twice:

while ((read = getline(&string, &len, infile)) != -1) 
 string[strcspn(string, "n")] = 0; // Remove newline 'n'
 MD5(string, strlen(string), result);

 // ...

while ((read = getline(&string, &len, infile)) != -1) 
 // Remove trailing newline character
 if (read > 0 && string[read - 1] == 'n') 
 read -= 1;
 string[read] = 0;
 

 MD5(string, read, result);

 // ...

However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.

In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)

This immediately revealed that most of the time is spent in fprintf():

enter image description here

Possible reasons are:

String formatting is slow.

All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.

The solution is to:

Write a custom function for converting the MD5 hash to a hex string.

Call printf for the entire string instead of each single byte so reduce the number
of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.

Putting it together

With all those modifications, we have

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer) 
 const char *hexDigits = "0123456789abcdef";
 char *dest = buffer;
 for (size_t i = 0; i < length; i++) 
 *dest++ = hexDigits[data[i] >> 4];
 *dest++ = hexDigits[data[i] & 0x0F];
 
 *dest = 0;
 return buffer;


int main(void) 
 FILE *infile = fopen("file.txt", "r");
 if (infile == NULL) 
 perror("Cannot open input file");
 exit(EXIT_FAILURE);
 
 FILE *outfile = fopen("MD.txt","w");
 if (outfile == NULL) 
 perror("Cannot open output file");
 exit(EXIT_FAILURE);
 

 // Read file line-by-line
 char *line = NULL;
 size_t linecap = 0;
 ssize_t lineLength;
 while ((lineLength = getline(&line, &linecap, infile)) != -1) 
 if (lineLength > 0 && line[lineLength - 1] == 'n') 
 // Remove newline character
 lineLength -= 1;
 line[lineLength] = 0;
 

 // Compute MD5 hash
 unsigned char md5hash[MD5_DIGEST_LENGTH];
 MD5((unsigned char*)line, lineLength, md5hash);

 // Print hash as hex string
 char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
 fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
 fputc('n', outfile);
 
 free(line);

 // Close output files
 fclose(infile);
 fclose(outfile);

Further suggestions

The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.

Implement a -h option to show a short help/usage of the program.

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

1

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

1

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

Â |Â
show 3 more comments

up vote
7
down vote

accepted

I'll start with a review of your current code before suggesting possible performance
improvements.

Various comments in your code do not add information and can be removed, for example

//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends

Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.

Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example

FILE *infile = fopen("file.txt", "r");
if (infile == NULL) 
 exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL) 
 exit(EXIT_FAILURE);

for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...

Declaring md5 at the top does not increase the performance.

Some variable names can be improved (of course this is partially opinion-based):

char *string is actually the current line.

size_t len is not the length of the current string,
but the capacity of the buffer (re)allocated by getline().

ssize_t read is Ã¢Â€Â“ strictly speaking Ã¢Â€Â“ not the number of bytes read into the
buffer because it excludes the NUL character.

int md5 is not an MD5 value but an index into the buffer containing the
MD5 hash.

I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.

Fix compiler warnings:

MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign

You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.

According to the C11 standard, the declaration of main should be one of

int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */

(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.

Performance improvements

After reading a line from the input file, the string is traversed twice:

while ((read = getline(&string, &len, infile)) != -1) 
 string[strcspn(string, "n")] = 0; // Remove newline 'n'
 MD5(string, strlen(string), result);

 // ...

while ((read = getline(&string, &len, infile)) != -1) 
 // Remove trailing newline character
 if (read > 0 && string[read - 1] == 'n') 
 read -= 1;
 string[read] = 0;
 

 MD5(string, read, result);

 // ...

However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.

In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)

This immediately revealed that most of the time is spent in fprintf():

enter image description here

Possible reasons are:

String formatting is slow.

All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.

The solution is to:

Write a custom function for converting the MD5 hash to a hex string.

Call printf for the entire string instead of each single byte so reduce the number
of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.

Putting it together

With all those modifications, we have

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer) 
 const char *hexDigits = "0123456789abcdef";
 char *dest = buffer;
 for (size_t i = 0; i < length; i++) 
 *dest++ = hexDigits[data[i] >> 4];
 *dest++ = hexDigits[data[i] & 0x0F];
 
 *dest = 0;
 return buffer;


int main(void) 
 FILE *infile = fopen("file.txt", "r");
 if (infile == NULL) 
 perror("Cannot open input file");
 exit(EXIT_FAILURE);
 
 FILE *outfile = fopen("MD.txt","w");
 if (outfile == NULL) 
 perror("Cannot open output file");
 exit(EXIT_FAILURE);
 

 // Read file line-by-line
 char *line = NULL;
 size_t linecap = 0;
 ssize_t lineLength;
 while ((lineLength = getline(&line, &linecap, infile)) != -1) 
 if (lineLength > 0 && line[lineLength - 1] == 'n') 
 // Remove newline character
 lineLength -= 1;
 line[lineLength] = 0;
 

 // Compute MD5 hash
 unsigned char md5hash[MD5_DIGEST_LENGTH];
 MD5((unsigned char*)line, lineLength, md5hash);

 // Print hash as hex string
 char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
 fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
 fputc('n', outfile);
 
 free(line);

 // Close output files
 fclose(infile);
 fclose(outfile);

Further suggestions

The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.

Implement a -h option to show a short help/usage of the program.

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

I'll start with a review of your current code before suggesting possible performance
improvements.

Various comments in your code do not add information and can be removed, for example

//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends

Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.

Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example

FILE *infile = fopen("file.txt", "r");
if (infile == NULL) 
 exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL) 
 exit(EXIT_FAILURE);

for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...

Declaring md5 at the top does not increase the performance.

Some variable names can be improved (of course this is partially opinion-based):

char *string is actually the current line.

size_t len is not the length of the current string,
but the capacity of the buffer (re)allocated by getline().

ssize_t read is Ã¢Â€Â“ strictly speaking Ã¢Â€Â“ not the number of bytes read into the
buffer because it excludes the NUL character.

int md5 is not an MD5 value but an index into the buffer containing the
MD5 hash.

I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.

Fix compiler warnings:

MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign

You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.

According to the C11 standard, the declaration of main should be one of

int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */

(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.

Performance improvements

After reading a line from the input file, the string is traversed twice:

while ((read = getline(&string, &len, infile)) != -1) 
 string[strcspn(string, "n")] = 0; // Remove newline 'n'
 MD5(string, strlen(string), result);

 // ...

while ((read = getline(&string, &len, infile)) != -1) 
 // Remove trailing newline character
 if (read > 0 && string[read - 1] == 'n') 
 read -= 1;
 string[read] = 0;
 

 MD5(string, read, result);

 // ...

However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.

In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)

This immediately revealed that most of the time is spent in fprintf():

enter image description here

Possible reasons are:

String formatting is slow.

All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.

The solution is to:

Write a custom function for converting the MD5 hash to a hex string.

Call printf for the entire string instead of each single byte so reduce the number
of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.

Putting it together

With all those modifications, we have

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer) 
 const char *hexDigits = "0123456789abcdef";
 char *dest = buffer;
 for (size_t i = 0; i < length; i++) 
 *dest++ = hexDigits[data[i] >> 4];
 *dest++ = hexDigits[data[i] & 0x0F];
 
 *dest = 0;
 return buffer;


int main(void) 
 FILE *infile = fopen("file.txt", "r");
 if (infile == NULL) 
 perror("Cannot open input file");
 exit(EXIT_FAILURE);
 
 FILE *outfile = fopen("MD.txt","w");
 if (outfile == NULL) 
 perror("Cannot open output file");
 exit(EXIT_FAILURE);
 

 // Read file line-by-line
 char *line = NULL;
 size_t linecap = 0;
 ssize_t lineLength;
 while ((lineLength = getline(&line, &linecap, infile)) != -1) 
 if (lineLength > 0 && line[lineLength - 1] == 'n') 
 // Remove newline character
 lineLength -= 1;
 line[lineLength] = 0;
 

 // Compute MD5 hash
 unsigned char md5hash[MD5_DIGEST_LENGTH];
 MD5((unsigned char*)line, lineLength, md5hash);

 // Print hash as hex string
 char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
 fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
 fputc('n', outfile);
 
 free(line);

 // Close output files
 fclose(infile);
 fclose(outfile);

Further suggestions

The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.

Implement a -h option to show a short help/usage of the program.

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

edited Apr 18 at 7:49

answered Apr 17 at 20:01

Martin R

14k12157

answered Apr 17 at 20:01

Martin R

14k12157

answered Apr 17 at 20:01

Martin R

14k12157

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

1

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

1

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

Â |Â
show 3 more comments

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

1

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

1

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
â€“Â Alex Lo
Apr 18 at 3:25

Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
â€“Â Alex Lo
Apr 18 at 3:26

Thanks for the adding the error handling part of the code and making the program C11 compliant.
â€“Â Alex Lo
Apr 18 at 3:32

@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â€“Â Martin R
Apr 18 at 5:59

@AlexLo: See update.
â€“Â Martin R
Apr 18 at 7:29

Â |Â
show 3 more comments

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr