You are viewing archived messages.
Go here to search the history.

Nick Smith 2021-06-23 08:32:41

All modern programming languages apart from Rust (and I guess Swift, with its reference-counting) rely on garbage collection: a "background thread" locates memory the process has forgotten about and marks it as available for re-use. However, this doesn't seem to be a sensible scheme in a distributed system where multiple processing devices each have local memories. That begs the question: if you want to design a programming language that can be transparently distributed over multiple devices, does it need to have a fancy type system (like Rust's) that enforces correct manual memory management?

One reason I'm thinking about this: most upcoming AI chips are using a "network-on-chip" architecture, which could also be called a "distributed system on a chip". A garbage collection algorithm on these chips would have to involve a message-passing protocol wherein different parts of the chip communicate to identify forgotten memory. This seems like an unnecessarily complicated and expensive approach to memory management.

Thoughts? 🦄

Stefan Lesser 2021-06-23 09:47:55

Like the sentiment and think Rust and Swift are on a better path, mainly because they manage memory deterministically, but if people don't find garbage collection wasteful today in basic single-threaded CPU-bound scenarios, I doubt it'll stop anybody from bringing it to a distributed environment.

Classic garbage collection makes a lot of sense for a simpler, centralized memory model, like a heap. I don't know how to adapt it to a massively parallel execution environment, but I'm sure it can be done somehow, likely involving an order-of-magnitude increase in (wasted) memory along the way, but it'll be much more convenient to use I'm sure.

I don't know much about "AI chips", but if they are optimizing for parallel execution and work anything like modern GPUs they probably already manage memory quite differently from CPUs with explicit descriptors (a form of type system), buffer hierarchies, and thread grouping with localized access to buffers. That way memory gets bound to certain computations (shaders) and freed once these computations are finished. Well, it's a little more complex as these are subdivided into workgroups, thread groups, and subgroups, but you'll get the idea.

If you squint your eyes you might see some parallels to Rust's ownership model and why deterministic memory management is so attractive, even in languages that mainly target CPUs. The future is more value (move/copy) and fewer reference semantics ("objects") with clear ownership to help avoid, or at least minimize, shared state, so we can reap the benefits of parallel processing. That's a promising approach to keep complexity in check, even though it might not quite feel like that yet when you're trying to configure a shader execution pipeline today.

Mariano Guerra 2021-06-23 12:32:48

Pony's Garbage Collection sounds the closest I can think of: https://tutorial.ponylang.io/appendices/garbage-collection.html

Mariano Guerra 2021-06-23 12:34:23

There are some papers and talks about it

Deech 2021-06-23 14:19:40

Nim & ATS do this as well.

William Taysom 2021-06-23 15:54:40

In my limited (yet significant) experience making a distributed system, the challenge wasn't so much garbage as ferrying relevant partial results between compute nodes. So an ownership model may match bandwidth constraints better than potentially costly deferences.

I guess it's an eager/lazy distinction. I mean GC is certainly a performance win if you never need to actually collect it. Granularity matters. A lot of programs operate in a sort of loop with a lot of objects allocated per frame or per request. So then it makes sense to have an allocator that only tracks when references cross the boundary. Squint and you can think of that as an eager generational collector.

S.M Mukarram Nainar 2021-06-23 18:30:08

I'm not sure I understand the problem—you can have actor-local heaps and run gc independently while communicating via message-passing. Erlang does this

Nick Smith 2021-06-24 00:57:13

S.M Mukarram Nainar That's a solution if your programming model is the actor model. By "transparent distribution" I mean something more implicit: your programming language doesn't have a means to talk about "nodes" and "messages", so there are no clear boundaries for a garbage collector.

Note that in Erlang, you essentially have automatic memory management (GC) within an actor, and manual memory management between actors. If you have an actor that is holding data that will later be queried by other actors, you need to know when it is safe to delete the data (or even the entire actor), i.e. you need to know when other actors are no longer holding "references" to it (of some kind).

It seems like Pony has an approach for inter-actor memory management. Thanks Mariano Guerra for the link. Though as I said before, the distribution model is still not quite as implicit as I had in mind 🙂. I don't think you want to program a 1000-core AI chip with the actor model. ML models don't want to be written as actor systems, and general-purpose programs even less-so.

I'm bringing up AI chips because some of them will be capable of running general-purpose programs. Think of them as the next step beyond GPGPU. Tenstorrent is an example. From their FAQ:

  • "Our computers are optimized for neural network inference and training. They can also execute other types of parallel computation."
  • "Network communication hardware is present in each processor, and they talk with one another directly over (on-chip) networks, instead of through DRAM."
  • [Compared to GPUs] "Our computers are easier to program, scale better, and are excellent at handling run-time sparsity and conditional computation."

Hopefully that answers your question Stefan Lesser: the programming model is very different to that of GPUs.

Note the irony that Tenstorrent's chips are an actor model at the hardware level, yet you're unlikely to want to program them using the actor model because of the sheer number of cores. What is our programming model for these machines? As stated in my original post, I think memory management needs to be explicit in the language (Rust-style), because you don't want all 1000+ cores to be running a distributed garbage collection scheme alongside the primary computation. It becomes less and less feasible the more cores you add. Tenstorrent is planning to have their chips plug directly together using high-bandwidth & low-latency interconnects so that you can have 100,000 cores or more. Imagine running a garbage collector on that. Seems very much the wrong direction to go in.

Stefan Lesser 2021-06-24 07:04:39

Nick Smith Oh, interesting! Do you have any other pointers to more technical resources from them or about other “AI chips”? Their FAQ isn’t going very deep and the website strikes me as very marketing/VC oriented.

I don’t get the sense that their programming model is “very different” to that of GPUs, more like they’re building on top of that model, but maybe that’s what you mean? And they clearly know about what makes GPU programming complicated and try to differentiate themselves from it — I’m vary of their marketing lingo…

As they mention PyTorch one way to leverage multiple independent units could be https://pytorch.org/docs/stable/notes/ddp.html.

This also points towards things like https://mlir.llvm.org/. In other words, more power to the compiler (and yes, fancy type systems)! :)

đź”— MLIR

U025PBD75TM 2021-06-24 13:33:46

Do you have an example application in mind? For instance if you want a generally responsive language, memory use efficiency, perhaps some convenience routines for objects going out of scope (e.g. automatic closing of file descriptors, sockets), predictable runtime and memory use, then reference counting seems like a better fit. OTOH:: scope exit becomes slower, and is potentially unbounded (imagine reclaiming a giant graph of data), overall runtime is higher because of all the accounting busywork and cpu cache disruption (can be as much as ~30% slower but it's complicated), and concurrency becomes harder: child processes will trigger copy on write (you can minimize the impact by storing the refcount in a small object header and storing all headers in a contiguous memory block), refcounts are also a contention issue for POSIX threads.

Jack Rusher 2021-06-24 14:42:03

You might want to read up on the various languages designed for the Thinking Machines CM series, including *Lisp and CM Lisp. I worked on one of these with 768 cores, but the top models had 65,000+ cores running in a parallel machine with perfectly pleasant high level language support.

Nick Smith 2021-06-24 22:59:14

Stefan Lesser There are some interviews and videos of Tenstorrent online. Here's a recent one with Anandtech. Here's a long podcast episode with the CTO where he talks about a bunch of computing-related stuff, including hardware architectures for AI. Here's a short technical presentation on how their chips work.

Nick Smith 2021-06-24 23:14:29

@U025PBD75TM You're mentioning a lot of issues on conventional hardware architectures (cache and thread contention) and sure, there are some trade-offs when choosing between reference counting and tracing GCs in that context. But I'm focusing more on massively-parallel architectures, which I think changes the rules a bit. For example, Tenstorrent's chips have no shared memory, no caches, and no threads. Instead they have a grid of compute units, each with a dedicated SRAM (not a cache) and capable of doing parallel matrix/tensor operations. This is my "application" if you like: writing programs that can run on this type of machine. Why? Because they're going to offer up 100x the compute power of CPUs and are more suited to heterogeneous workloads than GPUs. From what I can see, they have a chance at obsoleting the whole idea of a CPU, as long as we can program them. There is insane amounts of money pouring into these companies (for their applications to AI), and some of these chips are going to become widely-deployed in data centers and (eventually) consumer devices.

Nick Smith 2021-06-24 23:16:58

Jack Rusher I'll look them up, thank you 🙂

Andrew Martin 2021-06-26 22:33:44

Nick Smith fyi the successor to Pony is a microsoft research project called "Project Verona". I think it's one of the most interesting research projects in the more traditional PL world right now. It might not be what you're looking for but I think the heap model it uses is a lot more generalised and flexible than the actor model.

Andrew Martin 2021-06-26 22:38:37

https://www.microsoft.com/en-us/research/project/project-verona/

there's not much information up yet, but i think there are some talks explaining the memory model a bit, if you're interested.

Andrew Martin 2021-06-26 22:50:24

in reference to your original question, I would say that both Pony and Verona have type systems that are comparable to Rust's in fanciness (they also rely on concepts like linearity), but they aim to find a sweet spot in usability by permitting a more free-form programming style within regions.

Rob Haisfield 2021-06-24 13:31:07

I keep hearing about how Common Lisp has insanely cool tooling but struggling to find examples. Anyone have any links?

Haakon HR 2021-06-24 14:02:12

SLY (https://github.com/joaotavora/sly) and SLIME is a good place to start. I played around a bit with SLY last year and it has some pretty cool features such as stickers

đź”— joaotavora/sly

Paul Tarvydas 2021-06-25 03:12:30

I would suggest looking at 3 lisp IDEs:

a) Lispworks (very well integrated set of tools)

b) emacs + slime (contains everything, see, esp. org-mode)

c) Racket

Then, you need to know what it is that you want to accomplish: [+]

1) rapid prototyping, or,

2) compilation.

Early lisps emphasized 1, CL emphasizes 2 but has more 1 than most languages (see, for example, restarts).

[Asides: Debugging is easier when the debugger language is the same as the language being debugged. Debugging is facilitated when a language has no syntax and is expression-based. Debugging is facilitated when the debugger and the language being debugged are the same thing.]

http://www.lispworks.com/products/index.html

https://orgmode.org/

https://racket-lang.org/

[+] I contend that you don’t want both at once. [IMO, 1 and 2 are different views on 0 (The Solution/Design)].

đź”— Org Mode

Chris Maughan 2021-06-25 14:53:37

I haven't tried this one yet; but I've been meaning to....

S.M Mukarram Nainar 2021-06-26 14:20:27

Seconding SLY, I use it every day. It is great. Another interesting piece of software is clouseau: https://www.youtube.com/watch?v=-1LzFxTbU9E

I actually don't think the tooling for CL is all that great, as a user of the language. Most of the interesting workflows, etc that you can do aren't so much provided by tooling as augmented by tooling. More or less all the functionality exposed by stuff like SLY (including the debugger) is part of the base language standard. And I think that's the really interesting part about CL if you're trying to learn from it.

In particular, CL is designed as a whole to support interactive workflows, there's no one design decision you can point to here, but the combination of how it is image-based, the package system, use of late-binding, the ability to have dynamic binding, all work together to make it possible. It would be quite difficult to replicate what CL does without any one of those things. (Though it seems like clojure has managed to to some extent? I haven't used it so I don't know)

Take the example of editing your code while it's running: a common workflow when working with event driven code (a webserver, game, etc) is to have the event loop running, and then recompile event handlers as you change them so that you can mould the program on the fly. For that to be possible you need to be able to patch out the function calls in the old code to point to new code, and the CL runtime supports this, since it's runtime is image-based—all functions work this way, any file related metadata is incidental.

Another example is post-mortem debugging. I actually learnt this term from a C programmer, and it's considered a pretty esoteric technique. Debugging coredumps after your program crashes. In CL, this workflow doesn't warrant a term because that is just how everyone writes code—you let your program fail, it drops into the debugger, and then you fix the issue and move on. In particular the conditions and restarts system In addition to the ability to recompile stuff on the fly is necessary here, and the former is implemented with dynamically bound (so-called "special") variables.

As a final note, while I said the tooling in CL isn't all that amazing, I meant that in absolute terms. I would say it's about par with other ecosystems that have put work into tooling. The interesting thing is that CL tooling is mostly (entirely?) hobby projects, whereas for example, stuff like asan, dtrace, or rr that let you replicate these workflows in other languages are multimillion dollar projects. So CL lets people be very productive when building tooling around the language, because it provides the tools to build tooling.

If you're interested in what the debugger workflow looks like, this is a good series walking through it: https://malisper.me/debugging-lisp-part-1-recompilation/