Java MD5

Good morning. I woke up around 04:30 AM. Read for a while and then decided to try the Java MD5 challenge from HackerRank.

MD5 is a message-digest algorithm used to produce the same 16-byte string when using the same string. The input string is typically converted to bytes using a predefined format. This is done because we want the same digest to be produced when we input the same string.

This mechanism is used to determine if a document has been altered. It can also server to determine to determine if two strings are the same not knowing the actual strings. This last use is not foolproof. In 2004 Chinese researchers given a string used to create an MD5 digest, were able to create a completely different string (document) that would generate the same MD5 digest. That was one of the reasons today there a longer and more complicated mechanisms to produce different digest. In my Java code you can see some listed in the definition of DIGEST_ALGORITH.

I have worked for a long time with digests. I architected them into a storage server to verify that the stored and retrieved copies that a user would retrieve matched. At the time the Chinese researchers had not demonstrated an approach on how to generate the same MD5 digest using two different streams.

Allow me to explain what I did. I called Dr. Donald Rivest and discussed the issue of two different streams of bytes producing the same MD5 digest. He agreed that it was possible but at the time, it was highly unlikely. So I generated a new 32-byte string to identify documents in the storage server that had nothing to do with the MD5 digest. In the metadata for the document I stored the MD5, the run-length encoding and the size. That worked well at the time and it still does today.

After the storage product was released to the market, a Fortune 500 company came out with a competing product. The key feature was called deduplication. What they did was store documents as they were stored and when the same MD5 was computed, they would delete it and refer to the stored document for which the MD5 digest matched. That appeared to vendor to be a great idea. When a company distributes internal documents, different employees might save it to the storage server. On a company with thousands of employees a considerable amount of disk space could be saved.

A few years go by and different instances of their product would retrieve the wrong document. By chance, when the documents are in the millions or billions, MD5 digest may collide for two or more totally different documents.

The storage server I architected and designed, from the start kept additional information (i.e., MD5, size and run-length encoding) so if two documents would have the same MD5 digest were encountered, the software would check their size and run-length encodings. The issue has never occurred and chances are it never will. I consider the approach I took to be elegant and simple, something the developers of the Fortune 500 company where not able to realize or prevent.

Today there are new digests with more bytes. That reduces the chance of collisions. The issue is that MD5 is a cryptographic digest and governments have computer power and intellect to apply on problems of no interest to companies and academia.

The challenge provides two sample strings. In this case one string would suffice for testing. There are MD5 digest generators on the web that you can use to check an MD5 digest. For example, MD5 Hash Generator returns the following:

Enter your text below:

HelloWorld

MD5 Hash of your string:

68E109F0F40CA72A15E05CC22786F8E6

Our Java program, written using the Eclipse IDE, generates the following MD5 digests for the two sample strings:

HelloWorld
digest.length: 16
68e109f0f40ca72a15e05cc22786f8e6

Javarmi123
digest.length: 16
2da2d1e0ce7b4951a858ed2d547ef485

In this challenge we are using the MD5 digest, but Java supports several other standards. You can see this in the following code:

	public static void main(String[] args) {

		// **** select a digest algorithm ****
		final String DIGEST_ALGORITHM	= "MD5";
//		final String DIGEST_ALGORITHM	= "SHA-1";
//		final String DIGEST_ALGORITHM	= "SHA-256";	
//		final String DIGEST_ALGORITHM	= "SHA-512";

		// **** open scanner ****
		Scanner sc = new Scanner(System.in);
		
		// **** read string ****
		String s = sc.nextLine();
		
//		System.out.println("s ==>" + s + "<==");
		
		// **** convert string to byte array ****
		byte[] sb = null;
		try {
			sb = s.getBytes("UTF-8");
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
		}
		
		// **** instantiate the message digest ****
		MessageDigest md = null;
		try {
			md = MessageDigest.getInstance(DIGEST_ALGORITHM);
		} catch (NoSuchAlgorithmException e) {
			e.printStackTrace();
		}
		
		// **** generate the digest ****
		byte[] digest = md.digest(sb);
		
		System.out.println("digest.length: " + digest.length);
		
		// **** display the digest in hex ****
		System.out.println(byteArrayToHex(digest));

		// **** close scanner ****
		sc.close();
	}

I selected MD5 but, with the same set of steps you can use other algorithms and get different and longer digests.

If you are interest in my complete solution, you can find it in my GitHub repository.

Hope you found this post interesting, not only for the code but for the insights in architecting and designing software. I am a firm believer in thinking before writing code. Once you have an idea, you need it to write code and verify it. I have a hard time with the approach of developing a solution on paper and then writing the code for it. We all know that does not work. Think about the waterfall approach.

OK, enough is enough. Keep on reading and experimenting. Stand on the shoulders of giants when possible. In the above example, I did not think MD5 or for that matter no other digest algorithm that I know of, would work for de duplication. Discussed it with Dr. Rivest and came with a very elegant and simple approach to detect collisions and avoid deleting documents.

If you have comments or questions regarding this or any other post in this blog, or if you need my help on a software development project, please leave me a note below. Requests will not be made public.

Regards;

John

Follow me on Twitter:  @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.