A year and a half ago, in a cramped coworking space in the heart of the Scottish Highlands, I and other members of a new stealth team at Ziverge sat huddled around a MacBook Pro.
We witnessed something so pivotal, I was recording the sight on my phone.
We had just watched as an ordinary program that we compiled to WebAssembly was forcibly terminated mid-way through execution.
The forcible termination wasn’t the interesting part. The part that got us so excited was what happened after the termination.
We watched as some new open source software, codenamed Golem, brought back the failed program and restored its state to the moment before termination.
As if by magic, the program resumed exactly where it left off!
We had taken an ordinary program, and without any SDK or DSL, transformed it into one that was automatically fault-tolerant.
This early prototype became the heart of Golem, the world’s newest entry into the exciting new space of durable computing.
Today, after a year and a half of work by one of the technically best teams I’ve had the pleasure of working with, we are excited to announce the release of 100% open source Golem 1.0.
The idea for Golem was born after another open source project that Ziverge launched called ZIO Flow. Sponsored by a company in the insurance industry, this project was designed to bring a highly-reliable workflow engine to the ZIO ecosystem.
By highly-reliable, I mean a workflow engine that durably executes workflows (conceptually equivalent to continuous, whole-system snapshots), so that in the event of any restart, update, or fault, the workflow can be restored and resume activity where it left off.
Durable execution promises to give developers a bulletproof foundation for building highly reliable distributed systems–be they checkout workflows, financial transaction processing, stateful and long-lived AI agents, or just backend APIs that need to coordinate updates and activities across many systems.
It promises a massive reduction of architectural, infrastructure, and engineering costs, primarily because engineering this level of fault-tolerance atop commodity hardware generally implies adopting complex and behemoth event-driven architectures.
With ZIO Flow, we knew we were delivering something valuable, but after much deliberation, I concluded that ZIO Flow would never become mainstream.
The reasons were simple:
Despite ZIO Flow’s niche market, the idea remained intriguing to me. But after completing the first version of the project, we parked it for months.
Sometime in late 2022, I started digging into WebAssembly (WASM) for its potential to simplify cloud-native development, deployment, and operations.
Not long thereafter, I got a crazy idea: could we implement durable execution not with an SDK written in a specific language, but with an execution engine for any program compiled to WASM?
In theory, this would allow developers in any programming language and with any technology stack (so long as it compiles to WASM) to “push a button” and get durable execution for free.
The idea that you could push a button, and now your running programs would automatically survive restarts, updates, and faults, with zero changes–and on commodity clouds, without the need for specialized hardware or virtualization–seemed magical.
Possibly too magical to actually exist.
So after some weeks, I put together a tiny team inside Ziverge to fly out to Scotland for a couple weeks of intensive hacking.
The rest is history. Or at least, in the process of becoming history!
The launch event is today, August 23rd, 2024, which marks just over a year since we launched the Developer Preview of Golem.
The Developer Preview represented the bare minimum functionality necessary for developers to preview the technology.
Over the course of the developer preview, we gained early users, who built example systems on Golem, ranging from trading platforms to streaming analytics to campaign orchestration. We acquired a design partner, who worked closely with us to meet their needs on the platform.
From these early users and our design partner, we have incrementally matured and refined the developer preview into a package we believe is ready for production usage.
We are rolling this out into 1.0, and equipping it with guarantees appropriate both for production usage, as well as the early stage nature of the open source project.
In the next section, I will tour the major features of the 1.0 release.
As the home page says, Golem is a durable computing platform that runs serverless workers invincibly, impervious to faults, restarts, updates, and transient failures.
Let’s break down each of these components in more detail:
These features imply a lot of power–much more than can be explored in this post. But together, they enable you to build highly reliable distributed systems with impossibly simple code.
Now let’s take a look at some of the features we’ve managed to incorporate into 1.0.
Golem executes your workers transactionally, all the way from beginning to end. This guarantee holds even in the event of restarts, updates, and faults–including hardware failure, operating system failure, even power failure!
Transactional execution eliminates partial updates and inconsistent states, providing a robust foundation for building highly reliable distributed systems.
Because Golem executes workers transactionally, it means that any data stored in memory is persistent. This includes local variables, in the context of code that is currently executing, global variables, and even which part of the code is executing.
Durable state provides a way to reduce your application’s dependency on databases, key-value stores, and caches, because all in-memory data is as persistent as a database.
In other serverless platforms, “workers” are functions, which are invoked a single time. But with Golem workers are software components that are instantiated, with potentially many functions, and they can be invoked repeatedly and live as long as you need them to live.
Compared to “one-shot” workers, like lambdas, reactive workers allow much more sophisticated distributed systems to be built as pure code.
Serverless platforms like AWS Lambda timeout long-running workers. Golem, on the other hand, can execute workers for milliseconds, days, or even years–reliably and without loss of progress, state, or any data.
Golem’s support for long-running workers makes it easy to build and deploy workflows on the platform, including business process automation, ETL, report generation, stateful AI orchestration, user onboarding, and many other long-running business processes.
Golem shards worker execution across any number of nodes for horizontal scalability. Workers that are inactive, due to lack of use or because they are scheduled to activate in the future, are suspended and moved out of memory to conserve CPU and memory.
Golem’s built-in support for horizontal scalability, as well as easy deployment using Kubernetes, make it possible to solve the largest challenges.
Most cloud systems interact with each other through protocols like HTTP and gRPC. Golem instead allows workers to directly communicate with others in a type-safe way, without the need for JSON or gRPC serialization.
Worker-to-worker communication lets you perform internal communication across different stateful workers, without having to build all of the traditionally required protocol boilerplate.
Golem has two separate mechanisms for reliable communication: for communication within Golem (worker-to-worker), Golem guarantees reliable, exactly-once invocation, without possibility of failure. For external communication, Golem supports idempotency keys (which provide exactly-once semantics for APIs that support them) and automatic retries for transient failures.
By taking care of making communication reliable, including supporting exactly-once semantics (automatically for internal communication, opt-in for external), Golem makes it much easier to build highly reliable distributed systems.
Golem has built-in support for triggering workers from HTTP events, but there are many scenarios where these “automatic APIs” are insufficient. To make it easy to create custom APIs, Golem allows you to bind routes in OpenAPI definitions to workers, using a lightweight scripting language called Rib to do any data massaging.
With support for custom APIs, Golem lets you deliver any API that you want for front-end teams or third-parties, without having to contaminate your business logic with HTTP protocol code.
Golem can update workers as they are running to a newer version of their code, which is useful to fix bugs or add features to long-running (or infinite running) workers.
Golem is an early stage open source project. Although we have done our best to learn from early users and incorporate this feedback into the project, we know we will have missed some important features, and that the architecture of the project will continue to improve.
Yet, given that one of the primary use cases for durable execution platforms like Golem is long-running workflows, we feel we need to provide some guarantees that will encourage early adoption of Golem 1.0 in mission-critical use cases.
So we are providing the following guarantees with the Golem 1.x line:
In addition to these guarantees, we will provide best effort to keep backward compatibility for components outside the core of Golem, including Rib and custom API definitions.
For an early stage project, these guarantees mean you can build and deploy a wide variety of projects on Golem with the confidence that you will have extensive backward compatibility and bug fix guarantees well into the future.
Golem 1.0 is the starting point of a journey, not the end. Although the actual set of improvements that are made to the project depend on user feedback and third-party contributors, there are a number of different areas we plan to focus resources on in the coming months.
Some of the most important focus areas are as follows:
If you experiment with Golem and find you need some feature in order to go into production, then please just reach out to myself or any of the core Golem contributors.
Later on today (August 23rd), at 12 noon Eastern Time, we are planning a special launch event, where we introduce Golem and give you a taste of its power for solving complex problems.
Attending the launch event also enables you to participate in a Golem Hackathon, scheduled for August 30th, where you can win more than $5k in cash by building cool applications on Golem.
Beyond attending the demo, if this post has got you excited about Golem, then be sure to check out the following resources:
Cloud computing has transformed software engineering, triggering a cascade of change that has left many developers struggling with the complexities of building highly reliable distributed systems.
We believe the answer to the complexity and limitations of cloud computing is the simplicity and power of durable computing.
Along with other solutions in the space, we believe that Golem will lead to a massive shift in the way developers build reliable distributed systems.
We hope you are as ready for this future as we are!
Join the Golem community on Discord.
Subscribe to the Golem Open Source Newsletter to learn about improvements to Golem, and to hear about the latest articles, talks, and conferences that show you how to build reliable applications using Golem.