You are viewing archived messages.
Go here to search the history.

Luke Persola 2021-08-24 16:59:15

What minimal, universal AST specifications exist? Which are best?

Software that deals with code in multiple languages usually needs to handle that code in a common format. But trying to capture complete language semantics for all languages in a single format seems hopeless. Instead, we’re faced with a trade-off between the variety of languages we represent and the thoroughness of the representation. On one end, text data is essentially entirely universal (caveat: encodings), but is very low level. On the other extreme, a given language compiler’s AST format fully represents semantics, but only for that one language. There are also some complete AST formats for families of closely related languages. Language-independent compiler IRs like the LLVM IR represent code in many different languages, but I think their usefulness is limited because they require irreversible transformations (they always would, right?).

I’m interested in the spot just above text: interface specifications that can represent code as basic data structures for nearly all languages, with little to no specification of semantics. So generally little more than a tree spec with room left for language-specific data.

I’m only aware a few specifications like that:

  • as a baseline, s-expressions (not the textual format, just the concept of nested lists)
  • Babelfish’s UAST — they lost their web domain, so presumed dead
  • unifiedjs’s unist — I’m thinking of adopting this for my structure editor because its ecosystem seems healthy and my project is in JavaScript, but to date it’s largely used for markup languages

Do you know of any similar specifications? What are other projects using? Are you in a different spot on the trade-off spectrum?

Luke Persola 2021-08-24 22:22:47

On disk, I just store those data structures as JSON, but it could be any serialization language if you have a parser for it. unist (#3 above) has the same philosophy.

Kartik Agaram 2021-08-25 00:23:03

This might be more relevant to interop: https://dion.systems/blog_0002_roadmap.html

Luke Persola 2021-08-25 01:43:11

Kartik Agaram nice, seems like their doing something in this zone, I’m going to hop on their discord

Konrad Hinsen 2021-08-25 07:01:15

A bit of personal experience in this space: In the first iteration of my Digital Scientific Notation Leibniz (which is not a programming language, but still a formal language with syntax and semantics), I opted for an XML representation of the AST in order to permit processing Leibniz code more easily, without having to implement a parser. I picked XML over s-expressions because of better tooling support in various languages, and XML over JSON because of the richer structuring possibilities.

I had no intention of doing anything one could call universal, but then, XML or s-expressions may well be as close to universal as one can get: a tree structure. Now I wonder: is there anything beyond the tree structure that is universal about ASTs?

Konrad Hinsen 2021-08-25 13:27:56

Indeed. Forth and Factor (the only concatenative languages I know a bit) pose the same challenge for a universal AST as Common Lisp with its reader macros: how to handle languages that permit extending the parser?

Luke Persola 2021-08-26 02:37:18

@Kirill Chernyshov is the parser extension Konrad mentioned what you were suggesting there? (I thought for a second you meant the ASTs in those langs aren’t well represented as trees?)

Konrad Hinsen what could I search for learn more about that?

Kirill Chernyshov 2021-08-26 08:16:39

I meant to say that it is possible to have a stack of words and atoms as an another form of AST. I have been working on some sort of translator from clojure to concatenative language (in-house language based on joy) some time ago. And that was surprisingly nice.

Konrad Hinsen 2021-08-26 08:17:22

@Luke Persola Check out Reader Macros in Common Lisp, Parsing words from the Factor documentation, and PARSE from the Forth standard. All three are extension mechanisms for the language's parser, implemented in the language itself. Which means that a complete parser for either of these languages must include an implementation of the entire language.

In Common Lisp, this is not much of an issue except if you want to write a code processing tool in a different language. There is a well-defined representation of code after parsing, which is the level at which macros work. Moreover, reader macros are a rare in practice.

For Forth and Factor, parser extensions are very common and basically mean that there is no notion of an AST at all. As an example, consider the XML parser in Factor. It allows you to embed XML documents in Factor code, not as a string but as a syntax extension. Superficially, Factor code is a just a sequence of space-delimited words, but with embedded XML, it starts to look very different.

Luke Persola 2021-08-26 18:27:27

Konrad Hinsen thanks!

Kiril Videlov 2021-08-25 15:48:43

Hey! New to the community here (just intro’d myself in #introduce-yourself). Anybody interested in version control and coding workflows? In particular, the way Git is used industry wide at software companies. Here are some observations:

  • Despite the VCS being distributed, orgs use single source of truth
  • Units of collaboration are batches of change (PRs) and the smaller the better
  • Often code review is either nitpicking or shallow (“LGTM”)
  • “Getting” other people’s work-in-progress code (to help or just try it out) is laborious - (stashing own work, adding remote, fetching, checkout)

I am really into the idea of tight feedback loops when developing. Starting from how we help one another, feedback from CI all the way to actually shipping small, incremental changes to prod. I think there is a lot of power that could be unlocked for devs by continuous sync-ing of code in the cloud. WDYT?

Kiril Videlov 2021-08-25 19:38:30

Just did, thanks for sharing

Luke Persola 2021-08-26 02:46:07

@Kiril Videlov

By “single source of truth” here you mean having a master branch?

I think there is a lot of power that could be unlocked for devs by continuous sync-ing of code in the cloud.

So I guess the current best way to do this is having a culture where people try to commit smaller PRs more frequently, and frequently rebase their local changes on top of the shared branch. Are you imaging automated that a bit more to we don’t have to rely on individuals as much?

(I just read about the 2-way syncing so I think I understand now,)

Alexander Chichigin 2021-08-26 09:55:32

Also check out 3-way merge problem: https://tahoe-lafs.org/%7Ezooko/badmerge/simple.html

Alexander Chichigin 2021-08-26 09:56:21

Seems like the right way forward is a "patch algebra" like Darcs or better yet Pijul. 🙂

Alex 2021-08-28 07:59:35

One thing I've wondered about version control is what a system would look like that focussed on getting the code back to a previous state rather than the pervasive model of linear development towards some goal

Luke Persola 2021-08-28 20:16:26

So I kept coming back to

Normally people would just represent this change over time in the code itself

That is, why use something related to VCS as we know it for this purpose? I was thinking “we only ever actually run the most recent version of the code” and then realized that’s not true when I thought of migrating a cluster: then you have multiple versions of the same code running at once and (potentially) interacting with each other. Your system has to be thought of as including these different versions. Tangental but maybe relevant.

Alex 2021-08-28 20:19:11

It's not true when making music either. Making music is just as valid a use of code as making databases

Alex 2021-08-28 20:20:37

maybe more valid as we'll always make music.

Luke Persola 2021-08-28 20:24:18

Yeah, I’m all for artistic programming

Kartik Agaram 2021-08-28 20:26:14

More generally:

  • Drawing boundaries is an essential aspect of design. Boundaries allow you to make progress by making simplifying assumptions about a temporary area of concern. "Practical" programming often draws a box around "what's running in the computer right now". VCS conventionally lies outside that boundary.
  • Boundaries can also stifle creativity. Questioning boundaries is a super useful activity, as is happening in this thread.
Alex 2021-08-28 21:58:04

I think I disagree - creativity thrives off boundaries. Unbound creativity is noise

Alex 2021-08-28 21:59:14

Creativity is about pushing against and beyond boundaries but you still need the boundaries

Kartik Agaram 2021-08-28 22:00:11

I agree 🙂

Tak Tran 2021-08-28 23:48:57

As a hack, I wonder if you could use Continuous Integration (eg, github actions, CircleCI) to generate visualisations/aural representations, as you commit code (maybe generate new branches for units of time if need be)…so you can see snapshots of your past in whatever visual/aural form you’d like, to better “see” where you want to scrub back in time to. So to jump to a point in time, you’d just do git reset --hard [sha] . Interactive rebasing could work too, as long as CI regenerates the snapshots

Chris Knott 2021-08-29 08:35:01

If you are talking about live coding music like an algorave then I think you want something that somehow projects the code into a low dimensional space. In this space, "commits" would just be waypoint markers that are set down. So you could mark where your verse/chorus is then explore freely and get some sense of when you were getting back close to where you were.

While this appears ridiculous at first I've seen some really impressive dimensional reduction things on Chris Olah's blog. If you also limit yourself in terms of starting dimensionality this might work and could really add something to a live coding performance to see the path that the performer took.

I went to an algorave where there was a performance by two people editing the same code. They had two cursors and their edits showed up in different colours. It meant you could see exactly how each of them was remixing each others lines and how they were both contributing. It really added to it.

Kartik Agaram 2021-08-28 21:43:55

I have a problem I'd appreciate thoughts on, particularly from aficionados of visual programming. But it requires some setup, if you'll bear with me.. (see thread)

Kartik Agaram 2021-08-28 21:44:01

My Mu project (https://github.com/akkartik/mu) is building a computing stack up from machine code. The goal is partly to figure out why people don't do real prototyping more often. We all know we should throw the first one away, but we rarely do so. The hypothesis is that we'd be better about throwing the first one away if rewriting was less risky. By the time the prototyping phase ends, a prototype often tacitly encodes lots of information that is risky to rewrite.

To falsify this hypothesis, I want to make it super easy to turn any manual run into a reproducible automated test. If all the tacit knowledge had tests (stuff you naturally did as you built features), rewriting would become a non-risky activity, even if it still requires some effort.

Turning manual tests into automated ones requires carefully tracking dependencies and outlawing side-effects. In Mu, functions that modify the screen always take a screen object. That way I can start out with a manual test on the real screen, and easily swap in a fake screen to automate the test. Which brings us to my problem:

How do you represent screens in a test?

Currently I represent 2D screens as 2D arrays of characters. That is doable a lot of the time, but complicates many scenarios:

  • Character attributes. If I want to check the foreground or background color, I currently use primitives like check-screen-in-bg which ignores spaces in a 2D array, but checks that non-spaces match the given background attribute. In practice this leads frequently to tests that separately check characters and then go back to check colors and other attributes.
  • Non-text. Checking pixels scales poorly, either line at a time or pixel at a time. I've tried ideas like separate buffers for text vs pixels, so that at least text continues to test easily.
  • Proportional fonts. Treating the screen as a grid of characters works only when each character is the same width. If widths differ I end up having to go back to treating characters as collections of pixels.
  • Unicode. A font like unifont (http://unifoundry.com/unifont/index.html) is mostly fixed-width, but lots of graphemes (e.g. Chinese, Japanese, Korean, Indian) still require double-width to render. That takes us back to the problems of proportional fonts.

Can people think of solutions to any of these bullets in a text-based language? Or a more powerful non-text representation?

Chris Knott 2021-08-29 08:53:58

My instinct would be to accept that you are ultimately going to be checking pixels and see how far you can get with that. For example, saving draw calls instead of raw pixel values, tracking dirty regions/some kind of copy-on-write. Combining some zealous saving of state with some amount of recalculating can be pretty efficient.

Kartik Agaram 2021-08-29 13:10:18

Oh I'm actually not thinking about performance. I meant it's hard to make tests that are self evident when they're about pixel events. Extremely long lines, etc.

Luke Persola 2021-08-29 20:26:37

So you’ve already decided you need to test for all these things, right? You’re just trying to figure out the best way to model the state of the screen to cover all these cases?

Kartik Agaram 2021-08-29 20:43:03

Yeah. It's even just the syntax. In a complex app drawing to screen what's the cleanest way to represent desired screen state in the interface and get nice error messages (that use that representation) on test failure.