Optimizing MD5 OpenSSL Implementation in C for Precomputed Hash
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.
The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.
I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).
I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.
I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.
Any suggestion to increase the performance?
Bash Version:
#/!bin/bash
while read line
do
printf $line | md5
done
Python Version:
import hashlib
infile = 'wordlist'
outfile = open("precomputed","a")
with open(infile, "r") as inf:
for line in inf:
outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')
C Version:
//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
//------------------------------Main Function------------------------//
int main()
//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//
//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//
infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
//-------------------------MD5 Hash Function-------------------------//
MD5(string, strlen(string), result);
//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)
fprintf(outfile,"%02x",result[md5]); //convert the hash to hex
fprintf(outfile,"n"); //newline for the output file
free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends
I can provide the execution time with different size of input if requested. Any help will be appreciated.
performance algorithm c openssl
add a comment |Â
up vote
4
down vote
favorite
I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.
The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.
I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).
I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.
I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.
Any suggestion to increase the performance?
Bash Version:
#/!bin/bash
while read line
do
printf $line | md5
done
Python Version:
import hashlib
infile = 'wordlist'
outfile = open("precomputed","a")
with open(infile, "r") as inf:
for line in inf:
outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')
C Version:
//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
//------------------------------Main Function------------------------//
int main()
//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//
//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//
infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
//-------------------------MD5 Hash Function-------------------------//
MD5(string, strlen(string), result);
//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)
fprintf(outfile,"%02x",result[md5]); //convert the hash to hex
fprintf(outfile,"n"); //newline for the output file
free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends
I can provide the execution time with different size of input if requested. Any help will be appreciated.
performance algorithm c openssl
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.
The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.
I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).
I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.
I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.
Any suggestion to increase the performance?
Bash Version:
#/!bin/bash
while read line
do
printf $line | md5
done
Python Version:
import hashlib
infile = 'wordlist'
outfile = open("precomputed","a")
with open(infile, "r") as inf:
for line in inf:
outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')
C Version:
//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
//------------------------------Main Function------------------------//
int main()
//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//
//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//
infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
//-------------------------MD5 Hash Function-------------------------//
MD5(string, strlen(string), result);
//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)
fprintf(outfile,"%02x",result[md5]); //convert the hash to hex
fprintf(outfile,"n"); //newline for the output file
free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends
I can provide the execution time with different size of input if requested. Any help will be appreciated.
performance algorithm c openssl
I am trying to port a piece of code from Python to C. What the code does is that it generates a list of precomputed MD5 hash from a plaintext wordlist.
The code also has its variations in SHA-1 and SHA-2 families in all 3 programming languages. The structure is the same.
I originally wrote the code in Bash, which was slow. So I ported it to Python which was significantly faster. Now, I have successfully ported the code to C, in hoping that it will run even faster. However, even with
-Ofaster flag on in gcc, the code still run slower than the Python version (the difference in execution time increases exponentially with the input size).
I have also doubted about the efficiency of the OpenSSL crypto libraries, but it seems to me that they are relatively well-established after reading through their Documentation.
I am guessing that it is the nested loop that I implemented in the C version of the code that is slowing the whole thing down.
Any suggestion to increase the performance?
Bash Version:
#/!bin/bash
while read line
do
printf $line | md5
done
Python Version:
import hashlib
infile = 'wordlist'
outfile = open("precomputed","a")
with open(infile, "r") as inf:
for line in inf:
outfile.write(hashlib.md5(line.strip().encode('utf8')).hexdigest()+'n')
C Version:
//-----------Libraries
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
//------------------------------Main Function------------------------//
int main()
//------Define infile, outfile, file length. Define string to be read.---//
FILE *infile, *outfile;
char *string = NULL;
size_t len = 0;
ssize_t read;
//------Open File stream for read(r) and write (w). Error Handling.--//
//Part of MD5 Hash Function (Taken out of While Loop for Optimization)
int md5;
unsigned char result[MD5_DIGEST_LENGTH];
//
infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
//-------------Read line-by-line in using a while loop.--------------//
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
//-------------------------MD5 Hash Function-------------------------//
MD5(string, strlen(string), result);
//output
for(md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++)
fprintf(outfile,"%02x",result[md5]); //convert the hash to hex
fprintf(outfile,"n"); //newline for the output file
free(string); //free string
fclose(infile); // close file streams
fclose(outfile);
exit(EXIT_SUCCESS); //Program Ends
I can provide the execution time with different size of input if requested. Any help will be appreciated.
performance algorithm c openssl
edited Apr 18 at 3:44
Jamalâ¦
30.1k11114225
30.1k11114225
asked Apr 17 at 16:18
Alex Lo
587
587
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40
add a comment |Â
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
7
down vote
accepted
I'll start with a review of your current code before suggesting possible performance
improvements.
Various comments in your code do not add information and can be removed, for example
//------------------------------Main Function------------------------//
int main()
free(string); //free string
exit(EXIT_SUCCESS); //Program Ends
Always use curly braces
with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.
Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
or
for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...
Declaring md5
at the top does not increase the performance.
Some variable names can be improved (of course this is partially opinion-based):
char *string
is actually the current line.size_t len
is not the length of the current string,
but the capacity of the buffer (re)allocated bygetline()
.ssize_t read
is â strictly speaking â not the number of bytes read into the
buffer because it excludes the NUL character.int md5
is not an MD5 value but an index into the buffer containing the
MD5 hash.- I would rename
unsigned char result[MD5_DIGEST_LENGTH]
tomd5hash
.
Fix compiler warnings:
MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign
You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.
According to the C11 standard, the declaration of main
should be one of
int main(void) /* ... */
int main(int argc, char *argv) /* ... */
(see for example What should main() return in C and C++?
on Stack Overflow). The final return
statement can be omitted, there is an implicitreturn 0
.
Performance improvements
After reading a line from the input file, the string is traversed twice:
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);
// ...
First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline()
returns the number of characters written tostring
, i.e. a newline character can only be at position read - 1
:
while ((read = getline(&string, &len, infile)) != -1)
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;
MD5(string, read, result);
// ...
However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.
In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890
)
This immediately revealed that most of the time is spent in fprintf()
:
Possible reasons are:
- String formatting is slow.
- All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.
The solution is to:
- Write a custom function for converting the MD5 hash to a hex string.
- Call
printf
for the entire string instead of each single byte so reduce the number
of function calls.
On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.
Putting it together
With all those modifications, we have
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];
*dest = 0;
return buffer;
int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);
// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;
// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);
// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);
free(line);
// Close output files
fclose(infile);
fclose(outfile);
Further suggestions
The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.
Implement a
-h
option to show a short help/usage of the program.
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.
â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:crunch 7 7 1234567890
.
â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
 |Â
show 3 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
I'll start with a review of your current code before suggesting possible performance
improvements.
Various comments in your code do not add information and can be removed, for example
//------------------------------Main Function------------------------//
int main()
free(string); //free string
exit(EXIT_SUCCESS); //Program Ends
Always use curly braces
with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.
Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
or
for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...
Declaring md5
at the top does not increase the performance.
Some variable names can be improved (of course this is partially opinion-based):
char *string
is actually the current line.size_t len
is not the length of the current string,
but the capacity of the buffer (re)allocated bygetline()
.ssize_t read
is â strictly speaking â not the number of bytes read into the
buffer because it excludes the NUL character.int md5
is not an MD5 value but an index into the buffer containing the
MD5 hash.- I would rename
unsigned char result[MD5_DIGEST_LENGTH]
tomd5hash
.
Fix compiler warnings:
MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign
You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.
According to the C11 standard, the declaration of main
should be one of
int main(void) /* ... */
int main(int argc, char *argv) /* ... */
(see for example What should main() return in C and C++?
on Stack Overflow). The final return
statement can be omitted, there is an implicitreturn 0
.
Performance improvements
After reading a line from the input file, the string is traversed twice:
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);
// ...
First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline()
returns the number of characters written tostring
, i.e. a newline character can only be at position read - 1
:
while ((read = getline(&string, &len, infile)) != -1)
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;
MD5(string, read, result);
// ...
However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.
In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890
)
This immediately revealed that most of the time is spent in fprintf()
:
Possible reasons are:
- String formatting is slow.
- All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.
The solution is to:
- Write a custom function for converting the MD5 hash to a hex string.
- Call
printf
for the entire string instead of each single byte so reduce the number
of function calls.
On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.
Putting it together
With all those modifications, we have
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];
*dest = 0;
return buffer;
int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);
// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;
// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);
// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);
free(line);
// Close output files
fclose(infile);
fclose(outfile);
Further suggestions
The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.
Implement a
-h
option to show a short help/usage of the program.
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.
â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:crunch 7 7 1234567890
.
â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
 |Â
show 3 more comments
up vote
7
down vote
accepted
I'll start with a review of your current code before suggesting possible performance
improvements.
Various comments in your code do not add information and can be removed, for example
//------------------------------Main Function------------------------//
int main()
free(string); //free string
exit(EXIT_SUCCESS); //Program Ends
Always use curly braces
with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.
Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
or
for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...
Declaring md5
at the top does not increase the performance.
Some variable names can be improved (of course this is partially opinion-based):
char *string
is actually the current line.size_t len
is not the length of the current string,
but the capacity of the buffer (re)allocated bygetline()
.ssize_t read
is â strictly speaking â not the number of bytes read into the
buffer because it excludes the NUL character.int md5
is not an MD5 value but an index into the buffer containing the
MD5 hash.- I would rename
unsigned char result[MD5_DIGEST_LENGTH]
tomd5hash
.
Fix compiler warnings:
MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign
You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.
According to the C11 standard, the declaration of main
should be one of
int main(void) /* ... */
int main(int argc, char *argv) /* ... */
(see for example What should main() return in C and C++?
on Stack Overflow). The final return
statement can be omitted, there is an implicitreturn 0
.
Performance improvements
After reading a line from the input file, the string is traversed twice:
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);
// ...
First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline()
returns the number of characters written tostring
, i.e. a newline character can only be at position read - 1
:
while ((read = getline(&string, &len, infile)) != -1)
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;
MD5(string, read, result);
// ...
However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.
In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890
)
This immediately revealed that most of the time is spent in fprintf()
:
Possible reasons are:
- String formatting is slow.
- All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.
The solution is to:
- Write a custom function for converting the MD5 hash to a hex string.
- Call
printf
for the entire string instead of each single byte so reduce the number
of function calls.
On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.
Putting it together
With all those modifications, we have
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];
*dest = 0;
return buffer;
int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);
// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;
// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);
// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);
free(line);
// Close output files
fclose(infile);
fclose(outfile);
Further suggestions
The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.
Implement a
-h
option to show a short help/usage of the program.
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.
â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:crunch 7 7 1234567890
.
â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
 |Â
show 3 more comments
up vote
7
down vote
accepted
up vote
7
down vote
accepted
I'll start with a review of your current code before suggesting possible performance
improvements.
Various comments in your code do not add information and can be removed, for example
//------------------------------Main Function------------------------//
int main()
free(string); //free string
exit(EXIT_SUCCESS); //Program Ends
Always use curly braces
with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.
Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
or
for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...
Declaring md5
at the top does not increase the performance.
Some variable names can be improved (of course this is partially opinion-based):
char *string
is actually the current line.size_t len
is not the length of the current string,
but the capacity of the buffer (re)allocated bygetline()
.ssize_t read
is â strictly speaking â not the number of bytes read into the
buffer because it excludes the NUL character.int md5
is not an MD5 value but an index into the buffer containing the
MD5 hash.- I would rename
unsigned char result[MD5_DIGEST_LENGTH]
tomd5hash
.
Fix compiler warnings:
MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign
You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.
According to the C11 standard, the declaration of main
should be one of
int main(void) /* ... */
int main(int argc, char *argv) /* ... */
(see for example What should main() return in C and C++?
on Stack Overflow). The final return
statement can be omitted, there is an implicitreturn 0
.
Performance improvements
After reading a line from the input file, the string is traversed twice:
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);
// ...
First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline()
returns the number of characters written tostring
, i.e. a newline character can only be at position read - 1
:
while ((read = getline(&string, &len, infile)) != -1)
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;
MD5(string, read, result);
// ...
However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.
In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890
)
This immediately revealed that most of the time is spent in fprintf()
:
Possible reasons are:
- String formatting is slow.
- All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.
The solution is to:
- Write a custom function for converting the MD5 hash to a hex string.
- Call
printf
for the entire string instead of each single byte so reduce the number
of function calls.
On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.
Putting it together
With all those modifications, we have
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];
*dest = 0;
return buffer;
int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);
// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;
// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);
// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);
free(line);
// Close output files
fclose(infile);
fclose(outfile);
Further suggestions
The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.
Implement a
-h
option to show a short help/usage of the program.
I'll start with a review of your current code before suggesting possible performance
improvements.
Various comments in your code do not add information and can be removed, for example
//------------------------------Main Function------------------------//
int main()
free(string); //free string
exit(EXIT_SUCCESS); //Program Ends
Always use curly braces
with if-statements, even if the if or else part consists only
of a single statement. That helps to avoid errors if the code is edited later.
Declare variables at the narrowest scope where they are used, and not at the top
of the function. For example
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
exit(EXIT_FAILURE);
or
for (int md5 = 0; md5 < MD5_DIGEST_LENGTH; md5++) ...
Declaring md5
at the top does not increase the performance.
Some variable names can be improved (of course this is partially opinion-based):
char *string
is actually the current line.size_t len
is not the length of the current string,
but the capacity of the buffer (re)allocated bygetline()
.ssize_t read
is â strictly speaking â not the number of bytes read into the
buffer because it excludes the NUL character.int md5
is not an MD5 value but an index into the buffer containing the
MD5 hash.- I would rename
unsigned char result[MD5_DIGEST_LENGTH]
tomd5hash
.
Fix compiler warnings:
MD5(string, strlen(string), result);
// Passing 'char *' to parameter of type 'const unsigned char *' converts between pointers to integer types with different sign
You already use different exit codes to indicate success or failure of the program,
which is good. In addition, it is helpful to print some message (to standard error)
in the error case.
According to the C11 standard, the declaration of main
should be one of
int main(void) /* ... */
int main(int argc, char *argv) /* ... */
(see for example What should main() return in C and C++?
on Stack Overflow). The final return
statement can be omitted, there is an implicitreturn 0
.
Performance improvements
After reading a line from the input file, the string is traversed twice:
while ((read = getline(&string, &len, infile)) != -1)
string[strcspn(string, "n")] = 0; // Remove newline 'n'
MD5(string, strlen(string), result);
// ...
First to find a terminating newline character, and then again to determine the length.
This is not necessary because getline()
returns the number of characters written tostring
, i.e. a newline character can only be at position read - 1
:
while ((read = getline(&string, &len, infile)) != -1)
// Remove trailing newline character
if (read > 0 && string[read - 1] == 'n')
read -= 1;
string[read] = 0;
MD5(string, read, result);
// ...
However, the impact of this change depends on the line length, and I could not observe
a significant difference in my test.
In order to find further performance bottlenecks, I profiled the program now with
Xcode/Instruments (using the input file generated by ./crunch 7 7 1234567890
)
This immediately revealed that most of the time is spent in fprintf()
:
Possible reasons are:
- String formatting is slow.
- All stdio print operations are thread-safe, and therefore have to aquire and release
locks for each call.
The solution is to:
- Write a custom function for converting the MD5 hash to a hex string.
- Call
printf
for the entire string instead of each single byte so reduce the number
of function calls.
On my computer (a 1.2 GHz Intel Core m5 MacBook) this reduced the time for processing
the above file from 24.5 seconds to 4.4 seconds.
Putting it together
With all those modifications, we have
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
// Format the data as a hexadecimal string. The buffer must have
// space for `2 * length + 1` characters.
const char *hexString(unsigned char *data, size_t length, char *buffer)
const char *hexDigits = "0123456789abcdef";
char *dest = buffer;
for (size_t i = 0; i < length; i++)
*dest++ = hexDigits[data[i] >> 4];
*dest++ = hexDigits[data[i] & 0x0F];
*dest = 0;
return buffer;
int main(void)
FILE *infile = fopen("file.txt", "r");
if (infile == NULL)
perror("Cannot open input file");
exit(EXIT_FAILURE);
FILE *outfile = fopen("MD.txt","w");
if (outfile == NULL)
perror("Cannot open output file");
exit(EXIT_FAILURE);
// Read file line-by-line
char *line = NULL;
size_t linecap = 0;
ssize_t lineLength;
while ((lineLength = getline(&line, &linecap, infile)) != -1)
if (lineLength > 0 && line[lineLength - 1] == 'n')
// Remove newline character
lineLength -= 1;
line[lineLength] = 0;
// Compute MD5 hash
unsigned char md5hash[MD5_DIGEST_LENGTH];
MD5((unsigned char*)line, lineLength, md5hash);
// Print hash as hex string
char hexBuffer[2 * MD5_DIGEST_LENGTH + 1];
fputs(hexString(md5hash, MD5_DIGEST_LENGTH, hexBuffer), outfile);
fputc('n', outfile);
free(line);
// Close output files
fclose(infile);
fclose(outfile);
Further suggestions
The names of input and output file are compiled into the program, which makes it inflexible. Possible alternatives are
- Pass the file names as arguments on the command line, or
- Make your program read from standard input and write to standard output.
Implement a
-h
option to show a short help/usage of the program.
edited Apr 18 at 7:49
answered Apr 17 at 20:01
Martin R
14k12157
14k12157
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.
â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:crunch 7 7 1234567890
.
â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
 |Â
show 3 more comments
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.
â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:crunch 7 7 1234567890
.
â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:
gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.â Alex Lo
Apr 18 at 3:25
See the picture. MD5.py is the Python version. CNormal is my original code, and COptimized is the revised version you suggested. Although yours is slightly faster than mine, both of our codes are running slower than the Python one. Compiler command:
gcc filename.c -O3 -L /usr/lib -lssl -lcrypto
.â Alex Lo
Apr 18 at 3:25
Note: I generated the infile using crunch. Command:
crunch 7 7 1234567890
.â Alex Lo
Apr 18 at 3:26
Note: I generated the infile using crunch. Command:
crunch 7 7 1234567890
.â Alex Lo
Apr 18 at 3:26
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
Thanks for the adding the error handling part of the code and making the program C11 compliant.
â Alex Lo
Apr 18 at 3:32
1
1
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
@AlexLo: Both the Python program and the compiled C program run in approx 25 seconds on my MacBook. I'll do further investigations later.
â Martin R
Apr 18 at 5:59
1
1
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
@AlexLo: See update.
â Martin R
Apr 18 at 7:29
 |Â
show 3 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192306%2foptimizing-md5-openssl-implementation-in-c-for-precomputed-hash%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Some information about the input file size and timing would be interesting. In my tests, the C program is faster than the Python program.
â Martin R
Apr 17 at 18:40