All modern programming languages apart from Rust (and I guess Swift, with its reference-counting) rely on garbage collection: a "background thread" locates memory the process has forgotten about and marks it as available for re-use. However, this doesn't seem to be a sensible scheme in a distributed system where multiple processing devices each have local memories. That begs the question: if you want to design a programming language that can be transparently distributed over multiple devices, does it need to have a fancy type system (like Rust's) that enforces correct manual memory management?
One reason I'm thinking about this: most upcoming AI chips are using a "network-on-chip" architecture, which could also be called a "distributed system on a chip". A garbage collection algorithm on these chips would have to involve a message-passing protocol wherein different parts of the chip communicate to identify forgotten memory. This seems like an unnecessarily complicated and expensive approach to memory management.
Thoughts? 🦄
Like the sentiment and think Rust and Swift are on a better path, mainly because they manage memory deterministically, but if people don't find garbage collection wasteful today in basic single-threaded CPU-bound scenarios, I doubt it'll stop anybody from bringing it to a distributed environment.
Classic garbage collection makes a lot of sense for a simpler, centralized memory model, like a heap. I don't know how to adapt it to a massively parallel execution environment, but I'm sure it can be done somehow, likely involving an order-of-magnitude increase in (wasted) memory along the way, but it'll be much more convenient to use I'm sure.
I don't know much about "AI chips", but if they are optimizing for parallel execution and work anything like modern GPUs they probably already manage memory quite differently from CPUs with explicit descriptors (a form of type system), buffer hierarchies, and thread grouping with localized access to buffers. That way memory gets bound to certain computations (shaders) and freed once these computations are finished. Well, it's a little more complex as these are subdivided into workgroups, thread groups, and subgroups, but you'll get the idea.
If you squint your eyes you might see some parallels to Rust's ownership model and why deterministic memory management is so attractive, even in languages that mainly target CPUs. The future is more value (move/copy) and fewer reference semantics ("objects") with clear ownership to help avoid, or at least minimize, shared state, so we can reap the benefits of parallel processing. That's a promising approach to keep complexity in check, even though it might not quite feel like that yet when you're trying to configure a shader execution pipeline today.
Pony's Garbage Collection sounds the closest I can think of: https://tutorial.ponylang.io/appendices/garbage-collection.html
In my limited (yet significant) experience making a distributed system, the challenge wasn't so much garbage as ferrying relevant partial results between compute nodes. So an ownership model may match bandwidth constraints better than potentially costly deferences.
I guess it's an eager/lazy distinction. I mean GC is certainly a performance win if you never need to actually collect it. Granularity matters. A lot of programs operate in a sort of loop with a lot of objects allocated per frame or per request. So then it makes sense to have an allocator that only tracks when references cross the boundary. Squint and you can think of that as an eager generational collector.
I'm not sure I understand the problem—you can have actor-local heaps and run gc independently while communicating via message-passing. Erlang does this
S.M Mukarram Nainar That's a solution if your programming model is the actor model. By "transparent distribution" I mean something more implicit: your programming language doesn't have a means to talk about "nodes" and "messages", so there are no clear boundaries for a garbage collector.
Note that in Erlang, you essentially have automatic memory management (GC) within an actor, and manual memory management between actors. If you have an actor that is holding data that will later be queried by other actors, you need to know when it is safe to delete the data (or even the entire actor), i.e. you need to know when other actors are no longer holding "references" to it (of some kind).
It seems like Pony has an approach for inter-actor memory management. Thanks Mariano Guerra for the link. Though as I said before, the distribution model is still not quite as implicit as I had in mind 🙂. I don't think you want to program a 1000-core AI chip with the actor model. ML models don't want to be written as actor systems, and general-purpose programs even less-so.
I'm bringing up AI chips because some of them will be capable of running general-purpose programs. Think of them as the next step beyond GPGPU. Tenstorrent is an example. From their FAQ:
- "Our computers are optimized for neural network inference and training. They can also execute other types of parallel computation."
- "Network communication hardware is present in each processor, and they talk with one another directly over (on-chip) networks, instead of through DRAM."
- [Compared to GPUs] "Our computers are easier to program, scale better, and are excellent at handling run-time sparsity and conditional computation."
Hopefully that answers your question Stefan Lesser: the programming model is very different to that of GPUs.
Note the irony that Tenstorrent's chips are an actor model at the hardware level, yet you're unlikely to want to program them using the actor model because of the sheer number of cores. What is our programming model for these machines? As stated in my original post, I think memory management needs to be explicit in the language (Rust-style), because you don't want all 1000+ cores to be running a distributed garbage collection scheme alongside the primary computation. It becomes less and less feasible the more cores you add. Tenstorrent is planning to have their chips plug directly together using high-bandwidth & low-latency interconnects so that you can have 100,000 cores or more. Imagine running a garbage collector on that. Seems very much the wrong direction to go in.
Nick Smith Oh, interesting! Do you have any other pointers to more technical resources from them or about other “AI chips”? Their FAQ isn’t going very deep and the website strikes me as very marketing/VC oriented.
I don’t get the sense that their programming model is “very different” to that of GPUs, more like they’re building on top of that model, but maybe that’s what you mean? And they clearly know about what makes GPU programming complicated and try to differentiate themselves from it — I’m vary of their marketing lingo…
As they mention PyTorch one way to leverage multiple independent units could be https://pytorch.org/docs/stable/notes/ddp.html.
This also points towards things like https://mlir.llvm.org/. In other words, more power to the compiler (and yes, fancy type systems)! :)
đź”— MLIR
Do you have an example application in mind? For instance if you want a generally responsive language, memory use efficiency, perhaps some convenience routines for objects going out of scope (e.g. automatic closing of file descriptors, sockets), predictable runtime and memory use, then reference counting seems like a better fit. OTOH:: scope exit becomes slower, and is potentially unbounded (imagine reclaiming a giant graph of data), overall runtime is higher because of all the accounting busywork and cpu cache disruption (can be as much as ~30% slower but it's complicated), and concurrency becomes harder: child processes will trigger copy on write (you can minimize the impact by storing the refcount in a small object header and storing all headers in a contiguous memory block), refcounts are also a contention issue for POSIX threads.
You might want to read up on the various languages designed for the Thinking Machines CM series, including *Lisp and CM Lisp. I worked on one of these with 768 cores, but the top models had 65,000+ cores running in a parallel machine with perfectly pleasant high level language support.
Stefan Lesser There are some interviews and videos of Tenstorrent online. Here's a recent one with Anandtech. Here's a long podcast episode with the CTO where he talks about a bunch of computing-related stuff, including hardware architectures for AI. Here's a short technical presentation on how their chips work.
@U025PBD75TM You're mentioning a lot of issues on conventional hardware architectures (cache and thread contention) and sure, there are some trade-offs when choosing between reference counting and tracing GCs in that context. But I'm focusing more on massively-parallel architectures, which I think changes the rules a bit. For example, Tenstorrent's chips have no shared memory, no caches, and no threads. Instead they have a grid of compute units, each with a dedicated SRAM (not a cache) and capable of doing parallel matrix/tensor operations. This is my "application" if you like: writing programs that can run on this type of machine. Why? Because they're going to offer up 100x the compute power of CPUs and are more suited to heterogeneous workloads than GPUs. From what I can see, they have a chance at obsoleting the whole idea of a CPU, as long as we can program them. There is insane amounts of money pouring into these companies (for their applications to AI), and some of these chips are going to become widely-deployed in data centers and (eventually) consumer devices.
Nick Smith fyi the successor to Pony is a microsoft research project called "Project Verona". I think it's one of the most interesting research projects in the more traditional PL world right now. It might not be what you're looking for but I think the heap model it uses is a lot more generalised and flexible than the actor model.
https://www.microsoft.com/en-us/research/project/project-verona/
there's not much information up yet, but i think there are some talks explaining the memory model a bit, if you're interested.
in reference to your original question, I would say that both Pony and Verona have type systems that are comparable to Rust's in fanciness (they also rely on concepts like linearity), but they aim to find a sweet spot in usability by permitting a more free-form programming style within regions.