Optimizing MD5 OpenSSL Implementation in C for Precomputed Hash

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.



The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.



I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).



I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.



I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.



Any suggestion to increase the performance?



Bash Version:




#/!bin/bash

while read line

do

printf $line | md5

done



Python Version:




import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

for line in inf:

outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')



C Version:



//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//

infile = fopen("file.txt", "r");

if (infile == NULL)
exit(EXIT_FAILURE);

outfile = fopen("MD.txt","w");

if (outfile == NULL)
exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)

string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


MD5(string, strlen(string), result);

//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

fprintf(outfile,"n"); //newline for the output file



free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends



I can provide the execution time with different size of input if requested. Any help will be appreciated.







share|improve this question





















  • Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
    – Martin R
    Apr 17 at 18:40
















up vote
4
down vote

favorite












I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.



The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.



I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).



I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.



I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.



Any suggestion to increase the performance?



Bash Version:




#/!bin/bash

while read line

do

printf $line | md5

done



Python Version:




import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

for line in inf:

outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')



C Version:



//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//

infile = fopen("file.txt", "r");

if (infile == NULL)
exit(EXIT_FAILURE);

outfile = fopen("MD.txt","w");

if (outfile == NULL)
exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)

string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


MD5(string, strlen(string), result);

//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

fprintf(outfile,"n"); //newline for the output file



free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends



I can provide the execution time with different size of input if requested. Any help will be appreciated.







share|improve this question





















  • Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
    – Martin R
    Apr 17 at 18:40












up vote
4
down vote

favorite









up vote
4
down vote

favorite











I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.



The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.



I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).



I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.



I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.



Any suggestion to increase the performance?



Bash Version:




#/!bin/bash

while read line

do

printf $line | md5

done



Python Version:




import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

for line in inf:

outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')



C Version:



//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//

infile = fopen("file.txt", "r");

if (infile == NULL)
exit(EXIT_FAILURE);

outfile = fopen("MD.txt","w");

if (outfile == NULL)
exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)

string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


MD5(string, strlen(string), result);

//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

fprintf(outfile,"n"); //newline for the output file



free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends



I can provide the execution time with different size of input if requested. Any help will be appreciated.







share|improve this question













I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.



The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.



I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).



I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.



I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.



Any suggestion to increase the performance?



Bash Version:




#/!bin/bash

while read line

do

printf $line | md5

done



Python Version:




import hashlib

infile = 'wordlist'

outfile = open("precomputed","a")

with open(infile, "r") as inf:

for line in inf:

outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')



C Version:



//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>


//------------------------------Main Function------------------------//
int main()


//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//

//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//

infile = fopen("file.txt", "r");

if (infile == NULL)
exit(EXIT_FAILURE);

outfile = fopen("MD.txt","w");

if (outfile == NULL)
exit(EXIT_FAILURE);

//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)

string[strcspn(string, "n")] = 0; // Remove newline 'n'

//-------------------------MD5 Hash Function-------------------------//


MD5(string, strlen(string), result);

//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)

fprintf(outfile,"%02x",result[md5]); //convert the hash to hex

fprintf(outfile,"n"); //newline for the output file



free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends



I can provide the execution time with different size of input if requested. Any help will be appreciated.









share|improve this question












share|improve this question




share|improve this question








edited Apr 18 at 3:44









Jamal♦

30.1k11114225




30.1k11114225









asked Apr 17 at 16:18









Alex Lo

587




587











  • Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
    – Martin R
    Apr 17 at 18:40
















  • Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
    – Martin R
    Apr 17 at 18:40















Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
– Martin R
Apr 17 at 18:40




Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
– Martin R
Apr 17 at 18:40










1 Answer
1






active

oldest

votes

















up vote
7
down vote



accepted










I'll start with a review of your current code before suggesting possible performance
improvements.



Various comments in your code do not add information and can be removed, for example



//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends



Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.




Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example



FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);



or



for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ... 


Declaring md5 at the top does not increase the performance.




Some variable names can be improved (of course this is partially opinion-based):




  • char *string is actually the current line.


  • size_t len is not the length of the current string,
    but the capacity of the buffer (re)allocated by getline().


  • ssize_t read is – strictly speaking – not the number of bytes read into the
    buffer because it excludes the NUL character.


  • int md5 is not an MD5 value but an index into the buffer containing the
    MD5 hash.

  • I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.


Fix compiler warnings:



MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign



You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.




According to the C11 standard, the declaration of main should be one of



int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */


(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.



Performance improvements



After reading a line from the input file, the string is traversed twice:



while ((read = getline(&string, &len, infile)) != -1) 
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);

// ...



First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:



while ((read = getline(&string, &len, infile)) != -1) 
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;


MD5(string, read, result);

// ...



However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.



In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)



This immediately revealed that most of the time is spent in fprintf():



enter image description here



Possible reasons are:



  • String formatting is slow.

  • All stdio print operations are thread-safe, and therefore have to aquire and release
    locks for each call.

The solution is to:



  • Write a custom function for converting the MD5 hash to a hex string.

  • Call printf for the entire string instead of each single byte so reduce the number
    of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.



Putting it together



With all those modifications, we have



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];

*dest = 0;
return buffer;


int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);

FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);


// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;


// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);

// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);

free(line);

// Close output files
fclose(infile);
fclose(outfile);



Further suggestions




  • The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are



    • Pass the file names as arguments on the command line, or

    • Make your program read from standard input and write to standard output.


  • Implement a -h option to show a short help/usage of the program.






share|improve this answer























  • See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
    – Alex Lo
    Apr 18 at 3:25











  • Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
    – Alex Lo
    Apr 18 at 3:26










  • Thanks for the adding the error handling part of the code and making the program C11 compliant.
    – Alex Lo
    Apr 18 at 3:32






  • 1




    @AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
    – Martin R
    Apr 18 at 5:59






  • 1




    @AlexLo: See update.
    – Martin R
    Apr 18 at 7:29










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192306%2foptimizing-md5-openssl-implementation-in-c-for-precomputed-hash%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
7
down vote



accepted










I'll start with a review of your current code before suggesting possible performance
improvements.



Various comments in your code do not add information and can be removed, for example



//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends



Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.




Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example



FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);



or



for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ... 


Declaring md5 at the top does not increase the performance.




Some variable names can be improved (of course this is partially opinion-based):




  • char *string is actually the current line.


  • size_t len is not the length of the current string,
    but the capacity of the buffer (re)allocated by getline().


  • ssize_t read is – strictly speaking – not the number of bytes read into the
    buffer because it excludes the NUL character.


  • int md5 is not an MD5 value but an index into the buffer containing the
    MD5 hash.

  • I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.


Fix compiler warnings:



MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign



You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.




According to the C11 standard, the declaration of main should be one of



int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */


(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.



Performance improvements



After reading a line from the input file, the string is traversed twice:



while ((read = getline(&string, &len, infile)) != -1) 
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);

// ...



First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:



while ((read = getline(&string, &len, infile)) != -1) 
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;


MD5(string, read, result);

// ...



However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.



In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)



This immediately revealed that most of the time is spent in fprintf():



enter image description here



Possible reasons are:



  • String formatting is slow.

  • All stdio print operations are thread-safe, and therefore have to aquire and release
    locks for each call.

The solution is to:



  • Write a custom function for converting the MD5 hash to a hex string.

  • Call printf for the entire string instead of each single byte so reduce the number
    of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.



Putting it together



With all those modifications, we have



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];

*dest = 0;
return buffer;


int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);

FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);


// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;


// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);

// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);

free(line);

// Close output files
fclose(infile);
fclose(outfile);



Further suggestions




  • The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are



    • Pass the file names as arguments on the command line, or

    • Make your program read from standard input and write to standard output.


  • Implement a -h option to show a short help/usage of the program.






share|improve this answer























  • See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
    – Alex Lo
    Apr 18 at 3:25











  • Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
    – Alex Lo
    Apr 18 at 3:26










  • Thanks for the adding the error handling part of the code and making the program C11 compliant.
    – Alex Lo
    Apr 18 at 3:32






  • 1




    @AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
    – Martin R
    Apr 18 at 5:59






  • 1




    @AlexLo: See update.
    – Martin R
    Apr 18 at 7:29














up vote
7
down vote



accepted










I'll start with a review of your current code before suggesting possible performance
improvements.



Various comments in your code do not add information and can be removed, for example



//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends



Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.




Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example



FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);



or



for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ... 


Declaring md5 at the top does not increase the performance.




Some variable names can be improved (of course this is partially opinion-based):




  • char *string is actually the current line.


  • size_t len is not the length of the current string,
    but the capacity of the buffer (re)allocated by getline().


  • ssize_t read is – strictly speaking – not the number of bytes read into the
    buffer because it excludes the NUL character.


  • int md5 is not an MD5 value but an index into the buffer containing the
    MD5 hash.

  • I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.


Fix compiler warnings:



MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign



You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.




According to the C11 standard, the declaration of main should be one of



int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */


(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.



Performance improvements



After reading a line from the input file, the string is traversed twice:



while ((read = getline(&string, &len, infile)) != -1) 
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);

// ...



First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:



while ((read = getline(&string, &len, infile)) != -1) 
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;


MD5(string, read, result);

// ...



However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.



In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)



This immediately revealed that most of the time is spent in fprintf():



enter image description here



Possible reasons are:



  • String formatting is slow.

  • All stdio print operations are thread-safe, and therefore have to aquire and release
    locks for each call.

The solution is to:



  • Write a custom function for converting the MD5 hash to a hex string.

  • Call printf for the entire string instead of each single byte so reduce the number
    of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.



Putting it together



With all those modifications, we have



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];

*dest = 0;
return buffer;


int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);

FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);


// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;


// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);

// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);

free(line);

// Close output files
fclose(infile);
fclose(outfile);



Further suggestions




  • The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are



    • Pass the file names as arguments on the command line, or

    • Make your program read from standard input and write to standard output.


  • Implement a -h option to show a short help/usage of the program.






share|improve this answer























  • See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
    – Alex Lo
    Apr 18 at 3:25











  • Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
    – Alex Lo
    Apr 18 at 3:26










  • Thanks for the adding the error handling part of the code and making the program C11 compliant.
    – Alex Lo
    Apr 18 at 3:32






  • 1




    @AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
    – Martin R
    Apr 18 at 5:59






  • 1




    @AlexLo: See update.
    – Martin R
    Apr 18 at 7:29












up vote
7
down vote



accepted







up vote
7
down vote



accepted






I'll start with a review of your current code before suggesting possible performance
improvements.



Various comments in your code do not add information and can be removed, for example



//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends



Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.




Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example



FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);



or



for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ... 


Declaring md5 at the top does not increase the performance.




Some variable names can be improved (of course this is partially opinion-based):




  • char *string is actually the current line.


  • size_t len is not the length of the current string,
    but the capacity of the buffer (re)allocated by getline().


  • ssize_t read is – strictly speaking – not the number of bytes read into the
    buffer because it excludes the NUL character.


  • int md5 is not an MD5 value but an index into the buffer containing the
    MD5 hash.

  • I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.


Fix compiler warnings:



MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign



You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.




According to the C11 standard, the declaration of main should be one of



int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */


(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.



Performance improvements



After reading a line from the input file, the string is traversed twice:



while ((read = getline(&string, &len, infile)) != -1) 
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);

// ...



First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:



while ((read = getline(&string, &len, infile)) != -1) 
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;


MD5(string, read, result);

// ...



However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.



In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)



This immediately revealed that most of the time is spent in fprintf():



enter image description here



Possible reasons are:



  • String formatting is slow.

  • All stdio print operations are thread-safe, and therefore have to aquire and release
    locks for each call.

The solution is to:



  • Write a custom function for converting the MD5 hash to a hex string.

  • Call printf for the entire string instead of each single byte so reduce the number
    of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.



Putting it together



With all those modifications, we have



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];

*dest = 0;
return buffer;


int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);

FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);


// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;


// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);

// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);

free(line);

// Close output files
fclose(infile);
fclose(outfile);



Further suggestions




  • The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are



    • Pass the file names as arguments on the command line, or

    • Make your program read from standard input and write to standard output.


  • Implement a -h option to show a short help/usage of the program.






share|improve this answer















I'll start with a review of your current code before suggesting possible performance
improvements.



Various comments in your code do not add information and can be removed, for example



//------------------------------Main Function------------------------//
int main()

free(string); //free string

exit(EXIT_SUCCESS); //Program Ends



Always use curly braces with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.




Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example



FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);


FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);



or



for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ... 


Declaring md5 at the top does not increase the performance.




Some variable names can be improved (of course this is partially opinion-based):




  • char *string is actually the current line.


  • size_t len is not the length of the current string,
    but the capacity of the buffer (re)allocated by getline().


  • ssize_t read is – strictly speaking – not the number of bytes read into the
    buffer because it excludes the NUL character.


  • int md5 is not an MD5 value but an index into the buffer containing the
    MD5 hash.

  • I would rename unsigned char result[MD5_DIGEST_LENGTH] to md5hash.


Fix compiler warnings:



MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign



You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.




According to the C11 standard, the declaration of main should be one of



int main(void) /* ... */ 
int main(int argc, char *argv) /* ... */


(see for example What should main() return in C and C++?
on Stack Overflow). The final return statement can be omitted, there is an implicit
return 0.



Performance improvements



After reading a line from the input file, the string is traversed twice:



while ((read = getline(&string, &len, infile)) != -1) 
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);

// ...



First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline() returns the number of characters written to
string, i.e. a newline character can only be at position read - 1:



while ((read = getline(&string, &len, infile)) != -1) 
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;


MD5(string, read, result);

// ...



However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.



In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890)



This immediately revealed that most of the time is spent in fprintf():



enter image description here



Possible reasons are:



  • String formatting is slow.

  • All stdio print operations are thread-safe, and therefore have to aquire and release
    locks for each call.

The solution is to:



  • Write a custom function for converting the MD5 hash to a hex string.

  • Call printf for the entire string instead of each single byte so reduce the number
    of function calls.

On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.



Putting it together



With all those modifications, we have



#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>

// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];

*dest = 0;
return buffer;


int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);

FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);


// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;


// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);

// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);

free(line);

// Close output files
fclose(infile);
fclose(outfile);



Further suggestions




  • The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are



    • Pass the file names as arguments on the command line, or

    • Make your program read from standard input and write to standard output.


  • Implement a -h option to show a short help/usage of the program.







share|improve this answer















share|improve this answer



share|improve this answer








edited Apr 18 at 7:49


























answered Apr 17 at 20:01









Martin R

14k12157




14k12157











  • See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
    – Alex Lo
    Apr 18 at 3:25











  • Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
    – Alex Lo
    Apr 18 at 3:26










  • Thanks for the adding the error handling part of the code and making the program C11 compliant.
    – Alex Lo
    Apr 18 at 3:32






  • 1




    @AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
    – Martin R
    Apr 18 at 5:59






  • 1




    @AlexLo: See update.
    – Martin R
    Apr 18 at 7:29
















  • See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
    – Alex Lo
    Apr 18 at 3:25











  • Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
    – Alex Lo
    Apr 18 at 3:26










  • Thanks for the adding the error handling part of the code and making the program C11 compliant.
    – Alex Lo
    Apr 18 at 3:32






  • 1




    @AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
    – Martin R
    Apr 18 at 5:59






  • 1




    @AlexLo: See update.
    – Martin R
    Apr 18 at 7:29















See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
– Alex Lo
Apr 18 at 3:25





See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command: gcc filename.c -O3 -L /usr/lib -lssl -lcrypto.
– Alex Lo
Apr 18 at 3:25













Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
– Alex Lo
Apr 18 at 3:26




Note: I generated the infile using crunch. Command: crunch 7 7 1234567890.
– Alex Lo
Apr 18 at 3:26












Thanks for the adding the error handling part of the code and making the program C11 compliant.
– Alex Lo
Apr 18 at 3:32




Thanks for the adding the error handling part of the code and making the program C11 compliant.
– Alex Lo
Apr 18 at 3:32




1




1




@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
– Martin R
Apr 18 at 5:59




@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
– Martin R
Apr 18 at 5:59




1




1




@AlexLo: See update.
– Martin R
Apr 18 at 7:29




@AlexLo: See update.
– Martin R
Apr 18 at 7:29












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192306%2foptimizing-md5-openssl-implementation-in-c-for-precomputed-hash%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation