Hash. What is it and why is it needed.

Brother · Nov 20, 2021

Whatever cryptographic calculations you do, you probably won't be able to do without hashing. Electronic signature, data integrity check, comparison of two texts, etc. - hashing is used everywhere, hash is calculated everywhere. And, of course, a blockchain without a hash is simply unthinkable.

Today I will try to tell on my fingers what it is and why it is needed.

With this post I open the mini-heading "cryptography on fingers". I would like to understand if my readers need it. I ask you to actively use likes / dislikes to give me feedback.

Hash is also called convolution, this very accurately describes the essence of this mathematical operation.

We all know that any information in digital form is a sequence of numbers. The block of information can be very large - it can be the text of "War and Peace", or in general the whole of Santa Barbara. Or it can be very short. In the modern world, there are a huge number of blocks of information, and often some simple way is needed to uniquely identify each block. Or, as they sometimes say, get a "text imprint".

In this, the hash is very similar to a person's fingerprint. The fingerprint does not contain all the information about the person, but the person's fingerprint can be uniquely identified. A hash function does something similar for blocks of information.

We take a huge novel "War and Peace" and get from it a short, thirty-two-byte hash (256 bits). And this hash (the result of calculating the hash function) uniquely identifies the text "War and Peace".

Let's say that someone has taken and changed this text. For example, I changed one letter somewhere in the middle of the novel. Compute the hash again from the new, modified text. The resulting value will be very different from the past. Those. new text will give a different hash, even if only one letter has changed!

That is why this operation is also called the convolution operation. It kind of collapses a large block of information to a fixed size. And, of course, you won't be able to deploy it back. It is not possible to get a multi-megabyte source text from a thirty-two-byte value.

Actually, this is the main function of the hash - to get some unique numerical characteristic of a block of information. And this characteristic is of a fixed size, does not depend on the size of the original block of information. For this, the hash was originally invented.

How do you compare two different texts now? And we just calculate the hash from each. If the hashes are different, then the source texts are also different. Even if the difference is in one character in a huge novel. By the way, in programming, they do this from time to time: in order not to bother with character-by-character comparison of huge texts, they simply compare the hashes calculated from them. Or, to organize the storage of data by key-value pairs, a hash table is used.

Sometimes on some sites you can see interesting inscriptions like "MD5 checksum". The meaning is this: you download a file, but suddenly a doubt arises, but did someone replace the file? You can check this in different ways, one of them is to calculate the checksum. The checksum, by the way, is one of the hashing methods, the meaning is almost the same. Get a small numerical characteristic from a large text. A MD5 is the name of a specific hashing algorithm.

So, download the file, calculate the MD5 hash from it (there are utilities for this) and compare it with what is written on the site. Does it match? All perfectly! No? Oops ... Either it was not downloaded, or someone changed the file ... By the way, antiviruses have a database in which for each system file there is its hash. If the system file has a different hash on the user's computer, it means that the file is damaged and infected. And sometimes the antivirus says: "checksums of system files do not match" or "file hashes do not match".

How is the hash calculated? You can think of the simplest algorithm like this. We take the original block of numbers, and simply multiply them sequentially by each other. When the number becomes too large (exceeds the size of the required hash), we cut off the excess . And then we multiply. Is the number too large again? Cut off again . As a result, you get some number, which will be the hash. If one value is changed in the original byte sequence (in the original text), then the product will most likely change, and (most likely) our truncated value will change. This is an example of a primitive hash function. In practice, no one uses such an algorithm, because it is not resistant to a number of troubles. For example, if you just swap two symbols next to each other, the work will not change. And the text has changed!

Therefore, real hashing algorithms involve a lot of additional steps. Mixing bits, multiplying with different coefficients, etc. Everything so that even the slightest change in the source text dramatically changes the hash value. Only in cryptography they say not "striking", but "like an avalanche."

Generally speaking, it is possible that two different texts will suddenly give the same hash. This situation is called a collision. And when developing algorithms, they strive to ensure that the probability of this is extremely small. But from time to time, vulnerabilities are found in old algorithms (a way to form a collision), and therefore new and new hashing algorithms are constantly being developed. There are several dozen well-known algorithms alone.

In order to imagine how this is possible at all - to assign a unique numerical characteristic to each text, I once came up with such a picture for myself. Let's say every text ever written by a person has a serial number. Let this ordinal number be a unique characteristic (hash) of the text. If we have a hash size of 256 bits (most algorithms generate a hash from 140 to 256 bits), then using a number of this size we can enumerate every electron in the universe! Because there are only 10 to the power of 80 electrons in the universe, and this is comparable to two to the power of 256. it is clear that humanity is not even close to producing so many texts; so that for each of them there is a separate number from this huge number of possible numbers.

But as you can imagine, you cannot get the original text from the serial number. And it's true - the information is truncated when calculating the hash . Therefore, knowing the hash, it is impossible to find out the source text. This property of the hash function is very important and very widely used. For example, passwords are not stored on any system, but their hashes are stored. The user enters the password - the hash is calculated - if it matches the one in the database - great, then the password is correct.

And, of course, the hash, thanks to this property of irreversibility, is very widely used in the blockchain. It is this property that is very important there - the impossibility of picking up the initial value, knowing the requirements for the hash. I think to talk about this in a little more detail in one of the following posts.

Hash. What is it and why is it needed.

Brother

Professional

Similar threads