There's a new way to organize and store files online and in distributed systems. Here, we dive into what content addressing is, how it works, and some of its benefits and challenges.
What is Content Addressing?
Location-based addressing, what we're all used to, uses soft links (URLs) to point to where the file is located.
Content addressing, on the other hand, uses cryptographic hashing to generate a unique identifier (CID) from the data that describe itself. The CID acts as a hard link that points to a specific version of a file.
Say your friend recommends a book to you. They tell you the book is in the local library in the third row on the right, fifth bookcase, second shelf, third book from the left. So you follow their directions, hoping to find the book. The book might be there, or another book might have replaced it. On the other hand, if your friend gave you the ISBN, you could use that to find the correct book. The first example is how location addressing works today, and the second is how content addressing works.
How Content Addressing Works
Let's dive a bit deeper.
The first step is to generate a content identifier. As previously mentioned, CIDs are hashes. To generate a hash for a piece of data, we use a hash function.
Hash functions take an input (for example, a file in binary format) and produce a fixed-size string of characters, typically a sequence of numbers and letters. The output is called the "hash value" or "hash code.
Next, we want to associate the content identifier with the content. This involves choosing a storage medium and linking the content identifier to the stored content. If we choose to store a file on IPFS, for example, there are four steps:
- Chunking: IPFS breaks down the file into smaller fixed-size chunks.
- Content Addressing: IPFS calculates the cryptographic hash of each chunk using a secure hash function. This hash value uniquely represents the content of that specific chunk.
- MerkleDag: IPFS constructs a MerkleDag that organizes the chunks and their relationships. Each chunk is represented as a node in the graph, and the edges between nodes represent the relationships between chunks.
- Root CID: The root of the MerkleDag is computed by hashing the hash values of its children and generating a unique hash value. This hash value is the final Content Identifier (CID) for the file.
The final step is retrieving content using the content identifier. This is a 2-step process: 1. Locating the content based on the content identifier 2. Verifying content integrity using the identifier.
To continue with the IPFS example, you can either connect to IPFS using the CLI or use the Brave browser's IPFS gateway and enter the CID. This will fetch your file. If you want to verify that this particular version of the file is the one you're looking for, you can compare its hash or recalculate it. If they match, it's the same data. IPFS's MerkleDag structure means you can also traverse that to trace back your file.
CIDs are immutable hard links, so if even one byte of data is altered, a new CID will be generated during the hashing process. This helps ensure content authenticity in a decentralized and trustless environment.
Benefits of Content Addressing
There are several benefits to using content addressing. As previously mentioned, data cannot be tampered with without it being immediately obvious since any change to a file will generate a new hash for that file.
This also opens the door to immutable data storage because when we use content-addressed links, we create persistent data structures. This is especially necessary for archives, blockchains, and version control systems.
Data deduplication is another advantage. With the location-addressed Internet we have today, the same file can be stored an unlimited amount of times because there is nothing to tell the storage system it's redundant. It just creates a new variation of the file name and moves on (EX: File.jpg, File(1).jpg, File(2).jpg, etc.). But with content addressing, if Jack stores a picture of a volcano on IPFS and Jill stores the same file, they will each have the same hash. That file is only stored once. This reduces storage costs by eliminating duplicate content and also improves data retrieval speed in distributed systems.
There are numerous other benefits, including secure sharing with encrypted hashes, efficient distribution, and more.
Challenges of Content Addressing
While content addressing makes the links to files immutable, it does not persist the data on the distributed network. If the file is no longer available (deleted or garbage collected), it won't be retrievable using the CID. Therefore, to ensure a file remains accessible, it's important to save it to your own node or pin it using a storage service.
Discoverability is another challenge. Since CIDs are hashes, they're difficult to remember. Users can link their CIDs to a human-readable name using DNS Link, but it's an extra step.
In summary, content addressing is helping to fix some of the problems that have formed due to traditional location-based addressing, such as duplicate or unreliable files. It also forms the basis of persistent data structures, enabling immutable storage solutions and version control within file systems (like Fission's WNFS!).
If you'd like to learn more about content addressing, Protocol Labs' ProtoSchool tutorial is clear and easy to understand and acts as a jumping-off point for more detailed technical topics.