Skip to content

From a technical perspective, simple science network disk, shielding/second transmission resource implementation method and MD5 algorithm

Original link: https://www.itylq.com/netdisk-second-transmission-md5.html

Release date: 2022-09-28 Migration time: 2026-03-21

Network disk/cloud disk products are used more and more frequently in our daily work and life. From photos of travel, short videos taken casually, to various Korean dramas, Japanese dramas, European and American dramas and blockbusters we follow, to various PPT documents and electronic materials used in work, we have gradually become accustomed to uploading them to the network disk and accessing them at any time. When we upload some common software installation packages or popular film and television works, although the files are large, they often occur in seconds. How is this achieved?

1 How to block network disk/transfer resources in seconds

Vernacular: The file can determine a string of values ​​in some way, and the value is basically unique (why do you say basically? It will be analyzed later). In order to facilitate understanding, we might as well call it eigenvalue for the time being.

Second transfer implementation method: When you upload a file, the network disk client first calculates the characteristic value of the file and then queries it in the database. If there are the same characteristic values, it means it is the same file, and it will be "transferred" to your network disk in seconds. In fact, you are still using the resources that already exist on the network disk server.

Blocking implementation method: When the network disk backend reviewer believes that a certain file is illegal, the characteristic value of this file will be "blacklisted". Whether you are already in the network disk or planning to use magnetic shielding, this is why sometimes we still cannot succeed offline even after changing sensitive words - the characteristic value does not change.

2 MD5

Above we mentioned that the characteristic value is determined in a certain way, which is the MD5 algorithm we are going to talk about today (in fact, SHA-256 and SHA-512 are mostly used now, so I won’t go into details here). I won’t talk about a lot of written knowledge here, including the length of MD5, algorithm implementation, etc., but only the most critical: **MD5 is an algorithm that inputs information of variable length and outputs a fixed length of 128-bits. **

To put it bluntly: no matter what the length of the input variable is, the output length is fixed. If you translate it again, the 2G file and the 20G file are translated into 0101010101. The length of the eigenvalue calculated by the MD5 algorithm is the same.

At this time, someone will say: "The characteristic value is only so long, and there are so many files in the world. What if there are two different files with the same MD5 value?" Congratulations! Everyone will answer. Of course this situation will exist, but the probability is very small. The reason involves the algorithm itself. If you are interested, I suggest you read "Applied Cryptography: Protocols, Algorithms and C Source Programs (Original Book 2nd Edition)" (Actually, I haven't read it carefully. There are cryptography experts on the forum to popularize science for everyone).

Back to the question just now, what will happen if the MD5 values ​​of two different files are the same? Imagine this scenario: You are getting married today, and the wedding video you shot is very valuable. You save the video to the network disk, and it shows that the "second transfer" was successful. A few years later, if you want to review it again, you will open the Internet disk and click on the video, and you will find that it is other unrelated videos, such as 7 Calabash Babies... Similarly, when you upload a legal and compliant file, it will prompt "This file violates the regulations and has been blocked"...

3 Summary

The example given above is somewhat alarmist, and it is even possible that the network disk does not use MD5. But the principle is the same. In fact, the probability that two different files have the same MD5 value is very small (but now there is relevant technology), and the MD5 algorithm is now basically used for error correction, and there are more advanced algorithms (such as SHA-512) to replace it. I believe that with the development of technology and the advancement of cryptography, these problems will be solved, and maybe a perfect algorithm will appear in the near future to serve these products.

4 extension

When it comes to disseminating some public and general files, Netdisk Instant Transfer is commendable because it is convenient and saves time; however, it is a big hidden danger for files with confidentiality or privacy requirements. A little trick is to make a separate small txt file, and then Then package it together with the files that need to be uploaded. In this way, the MD5 value of the compressed package will definitely change, which can effectively reduce the probability of being identified as a resource that is transferred instantly... Of course, it is better to upload less important and sensitive data. No system is 100% reliable!


This article was moved from WordPress to MkDocs