This first edition of our Fission Reactor monthly research update features a first look at Dialog, a far edge database for local-first applications and autonomous computing agents.
Say hello to Quinn Wilton (@wilton_quinn), who joined the Fission team in mid February as an Applied Researcher. Reactor is the name of our applied research group, where we work on challenges related to Fission's edge computing and protocol engineering work. Below is her post summarizing some of the projects goals, background research, and design decisions made so far. A video of the call along with slides are available at the end of the post.
Dialog is an edge database we're building, meant for modelling rich application state and relational data in local-first software. It extends our WebNative SDK with support for concurrent access to structured data, and enables the creation of decentralized, collaborative applications.
Before describing the system itself, it's important to understand how some of the unique constraints faced by local-first databases have shaped our design decisions.
To start, our goal is to replace backends entirely, and that means that a far edge database of the type we're building must embrace uncertainty. This is an inescapable fact of the space, and it represents an extreme realization of exactly the problems that make more traditional distributed systems challenging. To name a few examples of what I mean:
- Devices may be offline for months or years at a time
- The set of devices is unknown and unbounded
- The capabilities of those devices vary
These three facts together mean that many of the techniques available to us in more standard distributed systems—garbage collection based on causal stability, and the use of vector clocks, for example—simply don't apply.
Worse yet, recognizing that network partitions represent the default state of the network, rather than transient blips, means acknowledging the risk of concurrent updates spanning long-lived branches of history. If you've ever performed conflict resolution on a many-months old Git branch, then you likely understand how difficult this is to get right, especially while ensuring that no work is lost in the process.
We propose that moving past these issues in an edge database means upending the way we've historically discussed these sorts of systems, and where normally we'd ask when a distributed system will converge, here we must instead question what the world looks like when convergence is infeasible.
...and that's where Dialog comes in. Dialog is an edge database we're building that uses a dialect of Datalog to query data in local-first applications. Its design goals include:
- Exposing Byzantine fault tolerant CRDTs built on a locally bitemporal store
- Modeling application state as views over these data types, with incremental view maintenance
- Enabling data integration over heterogenous and encrypted data sets
I know, that was a lot of buzzwords. I won't be able to do them justice in one blog post, but they're all ideas we'll be digging into more deeply during future updates.
Let's see how far we can get though!
Datalog is a declarative language from the 70s, with roots in logic programming. It began as a subset of Prolog, before developing into an independent field of study in the early 80s. It's since found applications in fields as diverse as databases, distributed systems, static analysis, and artificial intelligence.
Much of Datalog's early use stemmed from its connection to relational algebra, with pure Datalog being computationally equivalent to SQL with recursion. This makes it a powerful language for constructing complex queries, as demonstrated by modern implementations like DataScript.
- Logically monotonic distributed code is eventually consistent without any need for coordination protocols (distributed locks, two-phase commit, paxos, etc.)
- Eventual consistency can be guaranteed in any program by protecting non-monotonic statements (“points of order”) with coordination protocols.
This raises the question of whether we can tackle some of Dialog's consistency concerns by structuring our data types as logically monotonic queries. Fortunately, this is a question that Martin Kleppmann explored in "Data structures as queries: Expressing CRDTs using Datalog".
The idea is to use Datalog as the basis for building CRDTs, or conflict-free replicated data types. These are data structures that offer strong eventual consistency without coordination: roughly, two peers that have received the same events will converge on the same state. This concept is what prompted a lot of our early interest in Datalog, and we've since validated and extended some of Martin's original ideas.
Further complicating things for our use-case is that we're building a trustless database, and so our CRDTs must converge given invalid events, without any centralized validation. In particular, this means that we must be able to guarantee three essential properties in the presence of Byzantine faults:
- IDs must be unambiguous and unforgeable
- Causal dependencies must be acyclic
- Semantic invariants must be verifiable at read
These are all goals that can be achieved by explicitly modeling causality over a content-addressable DAG, like that provided by IPFS. This is the approach we've taken in the design of our earlier WebNative FileSystem, and is an idea that has also been discussed, again by Martin Kleppmann, in this year's "Making CRDTs Byzantine Fault Tolerant".
In our case, this means representing Datalog facts as 4-tuples, of (entity, attribute, value, causality), with the causality component denoting the causal dependencies for the fact. This parallels Datomic's choice of EAVT datoms, that instead make use of transaction IDs. By encoding facts in this way, we're able to directly embed them into a directed-acyclic graph, that explicitly models the causal history for the database. This also conveniently sidesteps the issue of decentralized logical clocks that I touched upon earlier!
Everything I've discussed so far comes together to enable what is essentially time travel over the database. We're able to query the state of the database at any point in history, and also fork off alternative timelines entirely—to run long-running experiments, or safely develop new features against production data, for example.
Both of these abilities follow directly from the explicit embedding of causality into a DAG, with time travel being analogous to a traversal over that graph: exactly as the idea is expressed in Git. This idea is further developed in a recent paper by Nicholas Schiefer and Geoffrey Litt, "Merge What You Can, Fork What You Can't: Managing Data Integrity in Local-First Software".
Time travel poses some practical concerns though: how do we efficiently recompute views from arbitrary points without starting over from the beginning? This is another place where we can lean on Datalog, through incremental view maintenance. This is a technique given by a class of algorithms for recomputing views in response to changing inputs; in essence, stepping directly from one materialization of a view to another as facts are added or removed from the local view.
In Dialog we plan to do this using an algorithm called Delete/Rederive (DRed). At a high-level, the algorithm takes as input a set of facts to be deleted from a database. It then deletes those facts, along with any that can be derived from them. Since some facts may have multiple derivations, this initial deletion is an overestimate, and so you follow-up by rederiving any facts with an alternate derivation. Altogether, this means that we're able to quickly speed through time in response to changing data.
Lastly, I want to briefly touch upon the idea of heterogenous data integration. Dialog takes inspiration from Christopher Meiklejohn's, "A Certain Tendency Of The Database Community", and considers every device as holding its own valid interpretation of the world.
In the real-world, networks rely on asymmetry and privacy; they form and disband based on the values and backgrounds of their participants; and they shift in response to new knowledge or ideas. Ignoring these realities means to totalize uncountably many distinct views of the world into a single experience: but that's exactly what it means to seek global consensus.
Dialog inverts this model by instead constructing views out of disparate sets of potentially encrypted data. Each device may be pulling from different data sources—the data feeds they trust, the encrypted data they've shared with their network, or the moderation services they subscribe to—but because all of that data is eventually consistent, the views over that data remain compatible.
In effect, we trade global convergence for locally deterministic and mutually compatible interpretations of data. This is a radical shift in thinking, but we believe that it allows us more flexibility in modeling the inherent chaos and disorder of real-world systems.
We're still early in the development of Dialog, but we're excited by what we're building, and all of the pieces of the design are starting to come together!
Video & Slides
Thanks for checking out the recap of our first Fission Reactor community call. This initial call was centered on the high-level ideas underlying the technology, but we'll be having similar sessions on the last Thursday of every month: each focused on diving into and discussing some aspect of the research we're doing.