data.dedupestor #
DedupeStore
DedupeStore is a content-addressable key-value store with built-in deduplication. It uses blake2b-160 content hashing to identify and deduplicate data, making it ideal for storing files or data blocks where the same content might appear multiple times.
Features
- Content-based deduplication using blake2b-160 hashing
- Efficient storage using RadixTree for hash lookups
- Persistent storage using OurDB
- Maximum value size limit of 1MB
- Fast retrieval of data using content hash
- Automatic deduplication of identical content
Usage
import freeflowuniverse.herolib.data.dedupestor
// Create a new dedupestore
mut ds := dedupestor.new(
path: 'path/to/store'
reset: false // Set to true to reset existing data
)!
// Store some data
data := 'Hello, World!'.bytes()
hash := ds.store(data)!
println('Stored data with hash: ${hash}')
// Retrieve data using hash
retrieved := ds.get(hash)!
println('Retrieved data: ${retrieved.bytestr()}')
// Check if data exists
exists := ds.exists(hash)
println('Data exists: ${exists}')
// Attempting to store the same data again returns the same hash
same_hash := ds.store(data)!
assert hash == same_hash // True, data was deduplicated
Implementation Details
DedupeStore uses two main components for storage:
- RadixTree: Stores mappings from content hashes to data location IDs
- OurDB: Stores the actual data blocks
When storing data:1. The data is hashed using blake2b-1602. If the hash exists in the RadixTree, the existing data location is returned3. If the hash is new:
- Data is stored in OurDB, getting a new location ID
- Hash -> ID mapping is stored in RadixTree
- The hash is returned
When retrieving data:1. The RadixTree is queried with the hash to get the data location ID2. The data is retrieved from OurDB using the ID
Size Limits
- Maximum value size: 1MB
- Attempting to store larger values will result in an error
the reference field
In the dedupestor system, the Reference struct is defined with two fields:
pub struct Reference {
pub:
owner u16
id u32
}
The purpose of the id field in this context is to serve as an identifier within a specific owner's domain. Here's what each field represents:
owner (u16): Identifies which entity or system component "owns" or is referencing the data. This could represent different applications, users, or subsystems that are using the dedupestor. id (u32): A unique identifier within that owner's domain. This allows each owner to have their own independent numbering system for referencing stored data. Together, the {owner: 1, id: 100} combination creates a unique reference that:
Tracks which entities are referencing a particular piece of data Allows the system to know when data can be safely deleted (when no references remain) Provides a way for different components to maintain their own ID systems without conflicts The dedupestor uses these references to implement a reference counting mechanism. When data is stored, a reference is attached to it. When all references to a piece of data are removed (via the delete method), the actual data can be safely deleted from storage.
This design allows for efficient deduplication - if the same data is stored multiple times with different references, it's only physically stored once, but the system keeps track of all the references to it.
Testing
The module includes comprehensive tests covering:- Basic store/retrieve operations
- Deduplication functionality
- Size limit enforcement
- Edge cases
Run tests with:
v test lib/data/dedupestor/
Constants #
const max_value_size = 1024 * 1024 // 1MB
fn bytes_to_metadata #
fn bytes_to_metadata(b []u8) Metadata
bytes_to_metadata converts bytes back to Metadata
fn bytes_to_reference #
fn bytes_to_reference(b []u8) Reference
bytes_to_reference converts bytes to Reference
fn new #
fn new(args NewArgs) !&DedupeStore
new creates a new deduplication store
struct DedupeStore #
struct DedupeStore {
mut:
radix &radixtree.RadixTree // For storing hash -> id mappings
data &ourdb.OurDB // For storing the actual data
}
DedupeStore provides a key-value store with deduplication based on content hashing
fn (DedupeStore) store #
fn (mut ds DedupeStore) store(data []u8, ref Reference) !u32
store stores data with its reference and returns its id If the data already exists (same hash), returns the existing id without storing again appends reference to the radix tree entry of the hash to track references
fn (DedupeStore) get #
fn (mut ds DedupeStore) get(id u32) ![]u8
get retrieves a value by its hash
fn (DedupeStore) get_from_hash #
fn (mut ds DedupeStore) get_from_hash(hash string) ![]u8
get retrieves a value by its hash
fn (DedupeStore) id_exists #
fn (mut ds DedupeStore) id_exists(id u32) bool
exists checks if a value with the given hash exists
fn (DedupeStore) hash_exists #
fn (mut ds DedupeStore) hash_exists(hash string) bool
exists checks if a value with the given hash exists
fn (DedupeStore) delete #
fn (mut ds DedupeStore) delete(id u32, ref Reference) !
delete removes a reference from the hash entry If it's the last reference, removes the hash entry and its data
struct Metadata #
struct Metadata {
pub:
id u32
pub mut:
references []Reference
}
Metadata represents a stored value with its ID and references
fn (Metadata) to_bytes #
fn (m Metadata) to_bytes() []u8
to_bytes converts Metadata to bytes for storage
fn (Metadata) add_reference #
fn (mut m Metadata) add_reference(ref Reference) !Metadata
add_reference adds a new reference if it doesn't already exist
fn (Metadata) remove_reference #
fn (mut m Metadata) remove_reference(ref Reference) !Metadata
remove_reference removes a reference if it exists
struct NewArgs #
struct NewArgs {
pub mut:
path string // Base path for the store
reset bool // Whether to reset existing data
}
struct Reference #
struct Reference {
pub:
owner u16
id u32
}
Reference represents a reference to stored data
fn (Reference) to_bytes #
fn (r Reference) to_bytes() []u8
to_bytes converts Reference to bytes