Arman Gungor's Blog Litigation Support and Technology

16Dec/091

IPRO eCapture de-Duplication

IPRO_logoMany of you are familiar with eCapture as an ESI processing tool. It can be frustrating at times that you have to run a data extract or processing job on discovered material to be able to identify duplicates. What if you need to run a quick report prior to processing? If you are not de-duplicating compound documents (i.e. not maintaining compound document structure), then this is fairly easy. You go to the Items table in your eCapture database and de-duplicate the documents based on the MD5Hash column.

However, if you are looking to de-duplicate on an attachment family level, you will find that the FamilyHash column is not populated until a data extract or processing job is run. This is still not a big deal as you can create family level hashes outside and run your report. However, if you need to de-duplicate against a previous job, you will have to make sure that your family hashes are calculated exactly the same way as eCapture calculates them. As of version 4, eCapture calculates family hashes by individually hashing each document, concatenating the hash values in an attachment family in ItemID order and hashing the resulting string.

For example, if your attachment family consists of files F1, F2 and F3 (in ItemID order) with MD5 hashes H1, H2 and H3 and md5() is an MD5 hash function, the family hash value will be md5(H1&H2&H3).  Once you establish a workflow to do this efficiently, I would highly recommend running your own de-duplication outside of eCapture on a previous project and verifying the results.

Another important point to consider

When eCapture calculates family hashes, it combines the hashes of every item in the attachment family. This includes extracted embedded documents if the option is selected. This has two consequences worth considering:

1- If you are de-duplicating against a previous job where embedded document extraction options were set differently (i.e. jobs have a different number of extracted embedded items), eCapture will naturally produce different family hashes for the two attachment families with different extracted embedded item counts. This will obviously prevent the same original native document group to be de-duplicated, simply because it was handled differently during the two processing sessions.

2- I have also run into cases where extracting embedded items from the same file results in extracted items that look identical but have different MD5 hashes. This will also prevent two identical e-mail families from being properly de-duplicated against each other.

Bookmark and Share
Comments (1) Trackbacks (0)
  1. Thanks a lot for sharing it with us!


Leave a comment


No trackbacks yet.