HASH IT OUT!
This article will discuss the idea of data and forensic hashes and hashing.
One part will discuss the need for hash values when dealing with forensic and electronic evidence. The second part will discuss how to process hash values for your specific needs.
Disclaimer: The mention of any program, website or algorithm in no way should be taken as an endorsement of same. And in some cases, I may even point out a flaw or limit to its actions.
Before we start, let us agree on what a hash value is. I’m in no way a mathematician. So any description used will hopefully be in plain English and layman’s terms. That being said let us examine some websites that attempt to define hash values, hashing, hash algorithms, etc.
Here are some sites I found which explain hash. There are many, both scientific and common definitions. So take these definitions with whatever grain of salt and rebuttal you wish. Many may seem redundant, but explain hash in their own way.
A Hash Value (also called as Hashes or Checksum) is a string value (of specific length), which is the result of calculation of a Hashing Algorithm. Hash Values have different uses. One of the main uses of Hash Values is to determine the Integrity of any Data (which can be a file, folder, email, attachments, downloads etc).
Hash values can be thought of as fingerprints for files. The contents of a file are processed through a cryptographic algorithm, and a unique numerical value – the hash value - is produced that identifies the contents of the file. If the contents are modified in any way, the value of the hash will also change significantly. Two algorithms are currently widely used to produce hash values: the MD5 and SHA1 algorithms.
Hash values are also useful for verifying the integrity of data sent through insecure channels. The hash value of received data can be compared to the hash value of data as it was sent to determine whether the data was altered.
Google: A hash value can be used to uniquely identify secret information. This requires that the hash function is collision-resistant, which means that it is very hard to find data that will generate the same hash value.
Wikipedia A cryptographic hash function is a special class of hash function that has certain properties which make it suitable for use in cryptography. It is a mathematical algorithm that maps data of arbitrary size to a bit string of a fixed size (a hash) and is designed to be a one-way function, that is, a function which is infeasible to invert.
Cryptographic hash functions have many information-security applications, notably in digital signatures, message authentication codes (MACs), and other forms of authentication.
Now that you have a number of definitions to choose from, let us talk turkey or hash.
Evidence collection, preservation, and presentation in reports and court. Read on:
Everyone who deals with “digital evidence” should be aware that no matter what or when you obtain the evidence, your ultimate goal or expected end point is to present this evidence in court. So treat all electronic evidence as if it will end up as court evidence. If you don’t do it from the beginning, it will be hard to backtrack later.
Even when a company suspects an employee of wrongdoing involving their computer data, and the personnel department (yes, I said personnel department, I’m not politically correct) decides to secure the employees data they should always expect that down the road this data may turn out to be forensic evidence in court and should be treated as such. If we assume the worst, and it becomes evidence in court, we must ensure that the original evidence is not tampered with, or altered in any way.
As one of the above definitions advised: (verify the integrity of the data). Any alteration by the examiner could possibly lead to a plausible defense. So how do we ensure this “non-alteration” integrity and validity? We hash all the data/evidence collected by the examiners. If you encounter a network breach and capture network traffic to files, a theft of company keys to the kingdom, improper email (pornography etc.), virus, extortion, anything that could get a person fired or arrested, you should make sure the original evidence is not tampered with and you can verify the integrity of the original data. To do this is to create a hash of any evidence from the get go.
Hashing the original data, any important intermediate product, and above all the final product produced that will be sent to attorneys, or produced in court should be hashed.
Hashing original evidence is justified and almost mandatory. But let the attorneys argue that one. What about any report provided to outsiders. Are you certain the recipient will not alter the content and present the alteration as original. Yes, you will say, the recipient has integrity. DAH!.
Simple solution I have used to hopefully guarantee integrity of your report/data.
1: Hash the report or appropriate data.
2: Take that value, put it in a file.
3: Then encrypt the file with a password only you have.
Send the evidence, and the file of encrypted hash value(s) to the opposition. Let them play with the data. When it comes time later to validate the integrity of what they received. Decrypt the encrypted hash value which you sent to them, and they had in their possession during the entire process. Then compare your original hash value with the one of the data they are working with. Simple three step process to ensure integrity of the original data, especially where “images” are used.
Now that you have found a way and set a process to verify and hopefully guarantee the integrity of the data you are working with, lets talk about the actual hashing process and/or programs.
HASHING PROCESS, VERIFICATION and CULLING
You will need software that will produce a valid hash of the data. There are as I see it two types of data which are the main important items to hash. First is the entire physical device. Most notably the hard drives being used by the subject of the investigation. But cell phones also come into this mix. There are a number of hardware and software procedures available to “image” the entire hard drive. Any device/software capable of doing this should also provide you with the hash value of the entire drive. This value should confirm that the original data collected is what you see is what you got.
In some instances, you will decide to only copy or use specific files. (ie: virus programs, documents, images, emails, etc). In these cases find a reliable product that can hash individual data items. Its your responsibility to determine that the program you use actually does what it advertises.
Don't always rely on replies from a list serve you belong to.
Now, proprietary devices/software often compress, place headers, footers, etc into the final image “file”. Thus making it a little difficult to independently confirm the hash value of the evidence image. I personally, even though it takes up a lot more room, prefer when possible, to create and produce what is called a dd image of the drive. Think about it. A dd image can be processed/looked at by almost any “forensic” process, not only its originator. This way, the hash cold be independently confirmed and validated by any software program capable of calculation of hash of “raw” data. Proprietary packages are good in that they compress, and make images manageable, but what happens 5 years down the road, when that company no longer supports the compressed image format you created 5 years before. Just my $.02.
Also, think about it. If you ask product ‘A’ to confirm its own calculation. Isn’t that like asking the fox if he raided the hen house. What if the product had an internal flaw that no one (except the defense) knew about. He (product A) will always confirm his answer. Use products which produce images and values that can be independently verified/validated. (another $.02).
Once we have the image hashed, then it comes to hashing individual files. You may wish to hash files to confirm they are inconsequential (or important) to your investigation.
One way to do this is to obtain the NIST, NSRL (National Software Reference Library) data set.
NIST NSRL software library
The NIST data set contain over 125 million hash values of “programs” which it considers “known” entities. “The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set which may be considered malicious, i.e. steganography tools and hacking scripts. There are no hash values of illicit data, i.e. child abuse images.” Notice NIST didn’t say, good, bad, or ugly. It is up to the user to determine the files providence and usefulness in your investigation.
Your analysis whether it be thru a forensic software suite (which does everything but cook dinner), or individual packages that calculate file hashes can use the NSRL data set to “hopefully” eliminate non-essential files, or identify important ones. Find and learn how to use appropriate software to calculate, confirm, massage and work with hash data that is generated.
The problem with the NIST format, is that is contains a lot of information in a csv format which may not be of any use to most people and hard to manipulate because of size and. Also:
The problem with the forensic suites, is that sometimes they require a specific format of the hash values you provide as reference. For instance, one suite, which may have changed its requirement since I last had a license, was that if you supplied a list of MD5 values, it required a header line of “MD5”. It just seems that a program with the smarts to analyze hash values, doesn’t need the explicit MD5 as a column header. Again my $.02. Another product I recently heard requires another specific format to be ingested into its analysis. On my soapbox again. These packages are so good, why require any special format. A list of MD5 hashes, is just that, a simple data list. (Now I’m up to $.04).
So, when dealing with hashes, whether to confirm importance or irrelevance find a process that works for you. And one you have thoroughly vetted, tested and can testify to. That being said, I have tested about 10 stand-alone hashing programs which are very well known and respected. And I hate to say, that about 60% of them fail on differing aspects of their operation. I’m not pointing fingers here. But giving just enough information that will make you take notice and test all your software. Because you may use a package that is not fully operational, use and depend on its output for your report, and get yourself in a lot of questions while on the witness stand. “Did you fully test and vet the piece of software you are using?” Push the envelope, or test the ends of the bell curve.
For those of you who wish to obtain a clean subset of the NIST data, check out my website -
SAMPLE NSRL DATA
Regarding my own hashing programs, and the culled NIST data sets which are clean, fixed length records, compatible with any piece of software worth its salt. (the 125 million unique down to about 42 million items. Think a spreadsheet can handle that?). Also, as an aside, for the 125 million total, and 42 million unique I DID NOT FIND any collisions.
For questions or answers (no flames please) regarding the hashing software, the NIST data records on my site, firstname.lastname@example.org .
Author - Dan Mares. Dan is a respected Friend and very knowledgeable digital forensic investigator.
Dan Mares founded Mares and Company, LLC in 1998 after retiring from a
27-year career as a federal law enforcement agent. During that
time he became interested in and obtained training in computer science. He
began developing software programs designed to analyze large amounts of
data retrieved from mainframes. Those programs were the
precursors to the current Maresware data analysis software. Around 1986, he
began working in the area of what is now termed 'computer forensics.' In
the search for tools more suitable to the specific needs of computer
forensic investigations, he began developing software that was nater called
Maresware computer forensic software.
While serving as a federal agent, Dan assisted in the development of the
Seized Computer Evidence Recovery Specialist (SCERS) course at the Federal
Law Enforcement Training Center in Glynco, Georgia. He also served
as a guest SCERS instructor. He also assisted in the development and teaching of
the Basic Data Recovery and Advanced Data Recovery classes at the National
White Collar Crime Center.
A few of the organizations he has appeared before as guest speaker include:
International Association of Computer Investigative Specialists (IACIS); University of Texas, Austin.; Kennesaw State College; U.S. Secret Service; FBI Academy in Quantico, Va.; High Technology Crime Investigation Association(HTCIA); and Norwegian National Police Academy.