Future of Coding History

Kartik Agaram 2020-09-20 23:43:02

I've been noodling for the umpteenth time on a representation for programs that reduces the need to "play computer". My post last night on #two-minute-week (https://futureofcoding.slack.com/archives/C0120A3L30R/p1600587602007800) triggered enough unexpected thinking together to get me to write up my recent attempts and try to trigger more.

We all simulate programs in our heads. The activity seems to break down into two major use cases:

Forward path: Extracting functions out of arbitrary computations.
Backward path: Imagining the execution of arbitrary computations containing function calls.

The forward path fits very well with ideas like starting with concrete examples and emphasizing data at all times. Nobody should ever have to start with a function definition. Instead, start with an example computation like: 18 * 9/5 + 32, and incrementally end up at a function like celsius-to-fahrenheit. The backward path fits with various metaphors for debugging programs. Debug by print, debug by step, time-travel debugging. A key concern is how to uncoil static computations (loops, recursion) into dynamic metaphors (traces, stack frames, interactive movements).

Postfix expressions fit beautifully with the backward path. As the demo of Brief (https://www.youtube.com/watch?v=R3MNcA2dpts) showed, execution is already quite uncoiled, with no backward jumps. While the Brief demo didn't show it (it's easy to spot where the presenter plays computer in their heads), it's reasonable to imagine a way to drill down into function calls, replacing words with their definitions. By contrast, conventional expressions -- tree-shaped and using names -- immediately throw up impediments in understanding what happens first.

However, the forward path is thornier:

It's common to claim that point-free programs make it easy to factor out new definitions, but that's only true when the definition consists of consecutive words. Consider how you would go from * 3 3 to a definition of square, or from 3 4 + 5 * to a definition of (a+b)*c.
After they're extracted, point-free functions are harder to understand. What does the stack need to look like at the start? How many words, what types, how many words consumed, all these questions require simulating programs in your head. Or a manual comment.

This was the idea maze in my head until I saw LoGlo (https://loglo.app/2020-06-16). The cool idea here has taken me weeks to articulate: lines have names and get separate stacks. Forth typically writes to names within lines with words like !. Limiting definitions to one per line strikes me as an advance. And having names gives us a way to make Forth words less point-free. I start imagining flows like turning * 3 3 into * x x using two 'rename' operations, and then turning the entire line into a new function. Like, imagine a UI with a code side on the left, and a scratch computation on the right:

After defining a function it might look like this:

              │

   : sq x     │  sq 3

        * x x │

              │

Notice how the definition of x: above gets replaced by the call to sq below. That's kinda pleasing.

But there's issues. This manipulation requires modifying definitions of free variables. Worse, I ended up with the function call in prefix order. Trying to make things consistent got me stuck up on a tree with a 2D layout until I noticed I'd lost the benefits of postfix that got me on this road in the first place. I'll include it here just in case it sparks ideas for others, but I'm starting to think it's a dead end.

Anyways, that's where I am, still looking for a representation that's easy to uncoil and where inlining function calls is a 'smooth' visualization.

📷 tree.png

Kartik Agaram 2020-09-20 23:49:52

Wait, do I just need to switch how I define names?

              │  3 :x

              │  * x x

              │

=>

   : sq x     │  3 sq

     * x x    │

              │

Garth Goldwater 2020-09-21 01:24:25

i have to read this in more detail later. you may be interested in http://www.nsl.com/k/xy/xy.htm, and https://hypercubed.github.io/joy/html/jp-flatjoy.html, two concatenative languages that eschew nested quotations (what I think you’re dealing with that got you to the prefix stuff).

xy lets every program have both a stack (representing… the stack) and a queue (representing the stream of tokens that are the rest of the program). It’s an interesting route around the problem you’re talking about, and I feel like it’s either exactly right or maybe the inverse of right

Garth Goldwater 2020-09-21 01:25:31

also: this got posted here a while ago and seems like an interesting route around the named variable problem: https://suhr.github.io/papers/calg.html

Garth Goldwater 2020-09-21 01:26:10

📷 Screen Shot 2020-09-20 at 9.26.05 PM.png

Garth Goldwater 2020-09-21 01:30:36

if im understanding the way you’ve written the example code-with-scratchpad above, the forthy way to write assignment would be : DEFINE X 3; which could be sugared down to :x 3 and also maybe punned to read as “put the symbol :x on the stack” ? i dont think that would work actually

Kartik Agaram 2020-09-21 01:32:19

Yeah, the idea is that naming is special syntax that doesn't sugar down to stack ops, and that happens outside the scope of any stack.

Garth Goldwater 2020-09-21 01:33:05

^^ that’s the same as forth. : enters immediate mode. dont really understand what that means but you stop interacting with the stack

Kartik Agaram 2020-09-21 01:39:45

I was referring to Forth's two kinds of names: definitions and variables (memory locations). x: 3 is intended to replace 3 x ! not : define x 3 ;.

Garth Goldwater 2020-09-21 01:45:30

oops i must have skimmed past that part in the stuff i’ve attempted to read on forth

Garth Goldwater 2020-09-21 01:46:28

Ohhh this is the store and retrieve stuff

Garth Goldwater 2020-09-21 02:13:55

i always spend time thinking about the define semantics because 1. the scoping is really weird and interesting and seems to “just work” even though they have a funky way of handling shadowing and 2. i think you can use the same thing for data if you just use quotations a la joy

Kartik Agaram 2020-09-21 02:48:18

Here's a diagram I scrawled of a more fully worked example (sum of squares). It shows 2 ways to imagine inlining. In option A we have a standard Forth, and inlining replaces words with their definitions. The stack stays the same, it just gets more intermediate steps.

In option B we still have a stack, but execution state also includes a namespace of values for (immutable?) variables. For example, adding :x to a line saves the top of the stack as the value of x. Every line starts with an empty stack, but can share data with previous lines via variables.

Now inlining shows a second stack in its own row. We might even want to expand the stack of the caller to fit the callee in, just one 'row' down to show that it's an independent, isolated stack. (The x in sq is unrelated to the x in the caller since each function has its own namespace.)

I think both are equally uncoiled. The big benefit of option B to me is that the accidental complexity of stack manipulation (swap and dup) has been eliminated.

📷 20200920_193626.jpg

Garth Goldwater 2020-09-21 04:16:30

we’re both in agreement that stack manipulation is bad and makes everyone feel bad for sure.

i’m not opposed to variables but i do think that B is more coiled—there’s shadowing in the example, which is hard for beginners (and sometimes annoying for experts), and you did draw a coil with your arrows to the inline function 😉.

id argue that B is more readable/comprehensible but also more coiled. at the very least, it would be harder to communicate visually where exactly x and y are coming from with the same clean and regular table view as you were using before. and the definition of anything using x or y now depends on your environment which means it can’t be factored seamlessly anymore

factor, joy, and other modern concatenatives use quotations and combinators to get around a lot of the shuffling, but combinators IMO are just as bad (the canonical example is called “bi”, and i can never remember whether it expects two quotes on top of the stack and applies each to the next two items respectively, returning them in order, or some other permutation. even trying to describe my misunderstanding in words is hard).

i definitely don’t have any immediate solutions yet but the comma product article i posted is one attempt, i think pattern matching on the stack itself could be another, and (weirdly) thinking of variables as like almost vim macros that replace with values in-place could be a third direction (like some shells let you expand, eg, how rm *.txt <tab> can expand the text at your prompt rm hello.txt readme.txt getting_started.txt in a directory containing all those files) i think i have about 8 total options for this but my notecards aren’t with me and those are the ones i both know of and can remember off the top of my head

Garth Goldwater 2020-09-21 04:18:10

another option might be to have every stack also hold an environment, which is almost like an interface in that it just “expects” an x or y to be defined. so like your variable dictionary gets carried around with your stacks. that’s appealing to me, since i have the conspiratorial belief that variable assignment is just nested record modification that’s been hidden behind some weird syntax

Kartik Agaram 2020-09-21 04:44:19

Yeah, you're mirroring many of my own concerns.

One thing I want to point out: the shadowing you mentioned is essential to this idea. Basically I rely on shadowing to avoid making decisions about variable names. However, once a function is defined it can now be invoked with entirely different names.

Jack Rusher 2020-09-21 08:30:14

Kartik Agaram Could you say a few more words about what you dislike in defining square as dup *? I always liked that sort of thing in FORTH, which is clouding my understanding.

Kartik Agaram 2020-09-21 15:40:39

Me too! I always enjoy stack manipulation puzzles. This isn't about fixing something broken in Forth, but about seeing what part of Forth I can take with me to spreadsheets and example-oriented programming.

I'm probably still failing to internalize the previous discussion at https://futureofcoding.slack.com/archives/C5U3SEW6A/p1599496562080600, so this would be a great place for a rebuttal-by-screencast. Assume you've already run (* 13 13) (or its Forth equivalent) at your favorite REPL. How do you get from that expression to a definition of the function square in persistent form?

The assumption I'm making is that nobody calculates the square of 13 at the REPL by typing 13 dup *. So it seems to me that we need some way to nudge people to massage 13 13 * into a form that needs a single copy of the input(s).

It's just a thought experiment in the end. It's a frame of reference I'm taking on for the duration, and in the process I'm working against some of my own discomfort with spreadsheets and other 2D representations.

[September 7th, 2020 9:36 AM] garth: https://news.ycombinator.com/item?id=23811382|https://news.ycombinator.com/item?id=23811382 great post on the lisp REPL’s difference from other REPLs. sidenote: can someone explain to me how lisp programmers go from talking to the REPL to updating the code in their source files? in talks and stuff it always seems to be with copy and paste. is that accurate?

Jack Rusher 2020-09-21 16:31:09

This is an interesting question. When I look at (* 13 13) I see 169. Or, to be more clear, when I see a form that operates only on constants, I think of it as a constant itself. So I would only execute a form like that one in a Lisp REPL using eval-and-replace semantics to calculate some constant I want to embed in the code. If, on the other had, I were planning to parameterize such a calculation I'd use a lambda (say, #(* % %) in Clojure notation), which lifts quite naturally to a def or defn if needed.

Jack Rusher 2020-09-21 17:41:09

BTW, although I understand that you aren't trying to "fix FORTH" in any sense, some of what came up in these last few threads reminded me of this paper that talks about adding various functional programming constructs to FORTH starting from the fewest possible cominators:

http://soton.mpeforth.com/flag/jfar/vol4/no4/article6.pdf

... and this classic about implementing a linear logic Lisp that compiles to FORTH:

https://hashingit.com/elements/research-resources/1994-03-ForthStack.pdf

Garth Goldwater 2020-09-21 21:53:24

this is vague and maybe irresponsible spitballing from a staying-on-topic perspective, but what if there was a kind of a stack shuffling operator, that split the stack n items down, and then pattern matched them back on top. so maybe you’d pass a series of symbols into it and transform the stack. eg ( n1 n2 — n1 n1 n2 )

Garth Goldwater 2020-09-21 21:53:55

like... you could just pun the usual stack notation syntax

Garth Goldwater 2020-09-21 21:54:11

and from a ux perspective would feel a lot like drag and drop

Garth Goldwater 2020-09-21 21:54:57

would that break the ability to break apart programs anywhere? i’m not sure but i suspect not

Alexey Shmalko 2020-09-22 06:30:28

Garth Goldwater you can do that in gforth with { n1 n2 } n1 n1 n2 syntax. ({ ... } introduces local bindings which you can reference at any point later.) I find it quite a useful feature when the word definition grows and juggles 3+ pieces of data

William Taysom 2020-09-23 05:11:05

One advantage of dup and shuffling is that they force you to pay the cost of non-linearity. Since named variables give you non-linearity for free, reference hydras sprout up easily.

I guess my bigger qualm is that catenative programs tend to feel less catenative and more conventually referential as they get larger. How can one keep the catenative spirit in bigger programs?

Maybe some kind of dynamic scope? Instead of one stack, each dynamic variable names the top of a stack.

With queues instead of stacks and you get to Lucid. Concatenative reactive programming here we come!

Garth Goldwater 2020-09-23 11:05:46

lucid is really interesting, thanks for the reference! wish there was more than a short wikipedia article and a ~270 page book

that also lead me to lustre, which has a feature i didn’t know i wanted: you name the name of the variable you return in the function signature

📷 Image from iOS

Garth Goldwater 2020-09-23 11:06:46

seems obvious in retrospect since we basically do that with modules when we mark things as public or export

William Taysom 2020-09-23 11:40:21

Named returns are nice for bits of cleanup at the end.

Garth Goldwater 2020-09-23 14:55:32

i just like it as like an outline—“heads up, this is the variable we’re going to finish with”. i guess it’s almost like a mental hyperlink to the answer to the question “ok... what are we doing this all for?”

Gray Crawford 2020-09-21 19:57:33

Have there been similar approaches to this style of formatting?

📷 Image from iOS

Kartik Agaram 2020-09-21 20:00:19

Should the division by 2 on line 106 be indented less? Just to make sure I'm following.

Gray Crawford 2020-09-21 20:20:55

I put the line 106 division centered between the two parenthetically-cyan elements

tj 2020-09-21 20:32:02

i understand the limit of text editor means you cant have the r block actually parallel to the l block for the addition, but i'd find it more readable if lines 102/103/104 were atleast merged to give as much parallelism to the addition as possible?

or put a different way, i'm taking it that the formatting of the division on new lines helps convey meaning by reference to traditional mathematic formatting... whereas, what meaning is being conveyed by putting the addition elements on new lines?

Kartik Agaram 2020-09-21 20:58:07

@Gray Crawford ohh, I see. Thanks. That makes sense and also seems like the major drawback here. (I know this isn't an answer to your question.)

Gray Crawford 2020-09-21 21:01:55

The rationale for 102-104 not merging is that I wanted each line to have only a single scale of operation — putting the left denominator on the same line as the right numerator reduces the negative space that assists the parsing of each chunk as visually separate

Gray Crawford 2020-09-21 21:04:33

I want it to be clear that the left and right operands are rotationally symmetric around the plus — if I but the left denom and right numer on the same line the addition gets sort of hidden in the mixing of hierarchies

Gray Crawford 2020-09-21 21:05:48

This is all mostly for fun and is very much an critique of the inadequacies of text editors for math notation

Gray Crawford 2020-09-21 21:08:14

I want the presented shape of the equation to be quickly parseable without having to remember which parentheses are paired as is common in single-line formulae

Gray Crawford 2020-09-21 21:09:20

Or if they are split into many separate variable definitions then the structure of the equation is even more obscured, requiring that sort of “thinking like a computer” to hold all disparate elements in mind

Chris Knott 2020-09-22 13:14:04

You could try a commented out horizontal divider maybe e.g. /``*----------------``*//

Jack Rusher 2020-09-26 07:40:22

This earliest example of using vertical spacing to format maths of which I'm aware was done on punch cards in 1978: http://bitsavers.informatik.uni-stuttgart.de/pdf/intermetrics/programming_in_hal-s.pdf

Jack Rusher 2020-09-26 07:41:45

📷 Screen Shot 2020-09-26 at 09.40.51.png

🕰️ 2020-09-15 18:51:50

...

Raathi 2020-09-21 22:41:13

Recently stumbled into https://withfig.com/ which is trying to make the workflow of building custom tools easier.

🔗 Fig - Visual Apps & Shortcuts for your Terminal

Will Crichton 2020-09-24 01:06:02

A big issue in programming is that a program is an extremely lossy record of history. Even with good comments, it’s hard to look at a program and know:

What are the alternatives that were considered, but ignored? Tried, but later discarded?
What order was this program written in? Is a particular line or function the focal point of this code?
What is the set of resources the author used to write this code?
How weathered is this code? How many bugs have happened in this code? How scared should I be to change it?

What are some ways in which programming environments could help us record this info / answer these questions without requiring additional effort from the author?

Kartik Agaram 2020-09-24 01:16:18

You might find this older thread useful: https://futureofcoding.slack.com/archives/C5T9GPWFL/p1595618648446000

[July 24th, 2020 12:24 PM] daniel.garcia.carmona: I think that many companies are standardizing a workflow, that before writing any code you need to write a design document with the options on how to solve a problem and the chosen solution. A lot of times this document has the code changes required by any of the solution options. To me the options that we didn't end up following seem as valuable as the chosen option, and those options should also be captured in code. A really easy way could be in separate branches, but then we loose visibility of them. Is anybody familiar with source control software or with patterns to work with source control that also keeps tracks of options considered but not followed at the end?

Chris Maughan 2020-09-24 06:33:03

Unit tests help some of this. They encode the constraints of the program and the expected behavior - assuming they are well written! In particular the last point - how scared should I be to change it? That can be answered by how robust the unit tests are. That is their great value; refactoring becomes much easier.

nicolas decoster 2020-09-24 08:50:49

Maybe one way to address some of your points is to store all the code editing history. One missing point will the tracking of the resources and the alternatives that have been considered but not tried.

I guess using environments that use CRDT can have this for free. CRDT used as editing actions log can be used to show what happen at what "time".

And adding other not-code artifacts (resources and record of not-tried alternatives) in the environnement might also help covering all your points.

Jared Windover 2020-09-24 13:58:38

I end up using git annotate as a proxy for some of this. It’s limitations are being line-based and being just the most recent thing to have happened to each line. I’m picturing something that turns the section of code I’m looking at into a series of slices that let me step through commits while seeing what has been stable and what has been turbulent.

Harry Brundage 2020-09-24 16:16:54

i think one big area that's possible right now without crazy advances in PL design is revealing production behaviour in the editor during development. it doesn't help that much with designing business logic and answering the questions you asked, but for managing all the accidental complexity of actually running code in production, there's a lot of data generated by logs, tracing systems, etc these days that i wish editors revealed much easier

Harry Brundage 2020-09-24 16:17:42

think hovering a name in your editor and getting a popover that shows values that name held for various traces in production, or the values it tends to hold for successful requests vs failed requests, etc

Harry Brundage 2020-09-24 16:18:21

pretty sure some advanced Java development setups allow for this kind of thing with remote debuggers running in prod, but it doesn't seem to have taken off

Harry Brundage 2020-09-24 16:21:26

it really does sound nice to work with a medium that reveals all the variations it could be shaped into next or all the variations it used to be in the shape of, instead of just the current shape, but i also think maybe our feeble brains can't deal with yet another dimension of abstraction when working with code. the concreteness of the code in front of you, the lossiness might actually be valuable in service of making programming possible in the first place

Kartik Agaram 2020-09-24 16:40:25

One thing I've wished for is the ability to ask, "do we have any tests where this variable has value ___?" Often when I have a bugfix to add I'll first comment out existing business logic, just to figure out the best place to put the new test.

Jimmy Miller 2020-09-24 21:27:02

This is an area of I have thought a lot about. My own personal programming project is aimed to help with this. But I've also been white boarding through ideas on how to help with this in the current world. Not sure I could summarize well the ways I want to actually help this problem here, but I do want to at least contribute a few meta thoughts.

First I think one really important thing to consider is if we can in fact recover the important parts of a programs history at all. Peter Naur in his paper "Programming as Theory Building" claims that re-establishing the theory of a dead program is strictly impossible. I'm sure so people might question this, but I think this is actually very important to consider. If we aim at trying to save a history, we have to consider our aims. If our goal is to recover the theory behind the program that could be a fools errand. Instead maybe we should be thinking about what parts of a programs history are important for helping us establish a new theory? Does that history have to be an actual history or could it be something like what is proposed in the talk "Idealized Commit Logs" (https://www.youtube.com/watch?v=dSqLt8BgbRQ)?

Second I think we actually should really consider learning more from historians, particular historians of artifacts (like art history). There is a great book on art history methodology called Patterns of Intention by Michael Baxandall. One of the points he makes is that when we are describing art, we are converting something visual to something linguistic. This is a lossy and biased format, one easy example he gives us that paintings don't actually a beginning and end point. We have to choose how to describe it and where to start. When we are studying art works, we are really studying them under some description of them. (He makes this much clearer for of us who are not artistically inclined by starting with a history of the forth bridge. His history of this artifacts asks as a really interesting example into what goes into an historical explanation.)

I think these same things hold true for programs. In capturing a history, we can't just be recording facts in a database, we are interpreting those facts, selecting the relevant ones and determining an order. What sorts of facts, in what sorts of orders can help people gain a new theory? Will different interpretations click better with different people? What things do we exclude by selecting that facts in this way? Is the loss worth the pay off?

What are some ways in which programming environments could help us record this info / answer these questions without requiring additional effort from the author?

I think this is particularly interesting. We all know that getting people to provide documentation is hard. It is even harder to make that documentation good. Short of completely automatic though, what if we could make this easier? What if we could make it more part of your current flow? What if you didn't have to leave your editor, but you also didn't have to mess up the code by leaving some big huge long comment in the middle of things?

What if you could decide that something was a good example for later and save it off with a single button press? What if you could make tours through your code bases, or document in those code bases various ways the data flows? What if you could make an interactive introduction to codebase.

I think these are the sorts of starting points we should consider. Once we can make these things work and work well manually, then we can start plugging in automations, start making automatic curation.

🎥 "Idealized Commit Logs: Code Simplification via Program Slicing" by Alan Shreve

larry 2020-09-24 21:49:51

in a big codebase there might be some utility in using ML. For example, applying frequent itemsets to change sets could identify chunks of code that are changed together more frequently than expected. This might be suggestive of risk, or of refactoring opportunities.

Will Crichton 2020-09-24 23:30:23

Jimmy Miller the reference to idealized commit logs is really interesting. I’ve also been exploring a tool that uses dynamic program slicing for program comprehension (https://github.com/willcrichton/inliner), although it’s a little more general in that it uses several source-to-source compiler techniques for simplifying programs, not just dead code elimination. The idea of using cumulative tests to construct diffs between slices is pretty cool, haven’t thought of that before.

🔗 willcrichton/inliner

Jimmy Miller 2020-09-24 23:39:18

Looks awesome Will Crichton ! That is the kind of thing I'm aiming to support in my language. But ideally it should take like 10 lines of code and be something you can just write on the fly. :)

Will Crichton 2020-09-25 00:14:23

Jimmy Miller as in a goal of your language is to support high-level program transformations? Do you have an example?

Kartik Agaram 2020-09-25 00:33:39

I've been thinking lately about how our minds have infinite levels of conceptual hierarchy, but our tools are inevitably limited to some finite number. That discrepancy inevitably leads to loss of information and entropy.

Jimmy Miller 2020-09-25 01:36:32

I talk a bit about the idea at the end of my talk here. https://youtu.be/9fhnJpCgtUw I've got a prototype I'm working on. But only so much free time to do it.

Basically the idea is to base the semantics on term rewriting. So the whole language is just data transformation. Then you can have a notion of meta-execution which is rules that match on the execution data. So you can write a rule that just says, tell me the current expression, what it transforms into, and which rule and clause matched that caused that transformation.

In my prototypes that means I can make a stepping debugger in 3-5 lines of code. And then a time travel debugger in only a few more. Should soon be able to do what we are talking about above as well.

Basically in order to program without a blindfold we need access to the unfolding of the execution of our programs, so why not let it be data that we can match on, query on, etc?

🎥 "Meander: Declarative Explorations at the Limits of FP" by Jimmy Miller

Daniel Garcia 2020-09-25 21:44:00

Kartik Agaram can you expand a bit more on this:

I've been thinking lately about how our minds have infinite levels of conceptual hierarchy, but our tools are inevitably limited to some finite number. That discrepancy inevitably leads to loss of information and entropy.

Kartik Agaram 2020-09-26 06:05:26

Jimmy Miller I just watched the idealized commit logs talk. Very well done, and it takes me back, because my PhD thesis involved program dependence graphs and constructing tiny slices out of very large programs that computed precisely the hard-to-cache memory accesses so that they could be prefetched on a separate thread (computer, actually).

Slices certainly have lots of applications, but there's a reason why there's no empirical evidence that they're valuable for comprehension: they aren't actually the most elegant ordering as he claims. The reason: data transformations. Halfway through the evolution of a program someone redid all the data structures to organize by one axis rather than another. Slices get killed there. In general, slices focus on code but the compiler techniques we have (as of 12 years ago) are forced to approximate data access. This loop accesses field x in some node of this linked list, so let's assume it accesses it in all nodes. That sort of thing.

Regarding the specific idea he proposes: it doesn't actually result in a very idealized commit log, because a) you still have to provide a heuristic test sequence, and that's non-trivial, b) lots of times you get a more comprehensible result if you combine sets of tests in a single 'idealized commit' (but that blows up your search space for a) even more), and -- most important! -- c) no program has complete test coverage. If you focus only on tests you lose valuable insight along the way.

Me, after spending 8 years trying to treat programs as black boxes and apply tools to them, I go back to the thing he dismissed at the start. There's no way to understand programs efficiently when the author didn't design for it from the start. If a program had multiple authors, it's as easy to read as the author least interested in comprehensibility made it. Programs have to be designed for comprehensibility. So toss out the modern social organization and its incentives for creating programs primarily as black boxes for people to use.

(I'll share my approach to idealized commit logs for the umpteenth time, just in case somebody hasn't already seen it: http://akkartik.name/post/wart-layers)

Daniel Garcia I don't recall the context at work where it came up, but I'm increasingly noticing myself constructing larger refactorings out of what my Java IDE provides, like renaming variables. I renamed a variable here, another variable there, did a few other things, and the end result was that I split up a class that handled some input space into two classes that partition the space between themselves. It would be nice (in a first world problems sort of way) if that was obvious in the diff. This isn't a strong opinion, just a random idea.

Chris Maughan 2020-09-26 06:57:57

Interesting thread. My requirements are for storing code points to capture serendipity. In my tool, I have the beginnings of a system which stores code deltas in the project while live coding. I want to enable users to capture that moment when a visual effect or sequence of sound is great and rewind to that point afterwards (i.e. when not on stage giving the actual performance). This is easier in a live coding environment because things are more constrained in a single session/tool. Good live coders often experience that moment where they are in the zone, and a perfect combination of audio/visual has occurred, but then it is lost forever behind the complex edits they used to get to that point.

Will Crichton 2020-09-26 18:04:54

Kartik Agaram one interesting part of the idealized commit log idea is that it doesn’t actually use slicing, just code coverage (which is trivial to compute with sufficient runtime instrumentation). Also because it’s dynamic, there’s no need for conservative approximations.

I agree that the heuristic ordering of tests is problematic. Another alternative might be to think more statistically. Think about each test as a document and each line of code invoked as a term. Then a test’s focus would be the lines of code with the highest TF-IDF score (i.e. lines that occur in the test that happen less frequently in other tests).

Chris Maughan 2020-09-26 21:08:02

Just watched that video - really interesting since I’ve never heard of code slices before. My only reaction though is that it comes back to how well the unit tests are witten, as noted above. It’s probably unrealistic to expect tests to cover even half the code base in most projects. I think I still like the idea of a complete record of every program edit with a scrubber. Perhaps highlighting the code lines that are still in the TOT version, and showing heat maps for areas that are changed often, etc. Perhaps I could deep dive into a region of code and see how it evolved; with optional hiding of low code coverage areas...

🕰️ 2020-07-24 19:24:08

...

Alexey Shmalko 2020-09-24 06:50:07

This kinda goes in the opposite direction than the rest of the discussion, but besides design documents, there are Architectural Decision Records (ADR). They are more lightweight, less focused on the implementation details and more on the reasoning behind the decision.

In a nutshell, it's just a problem statement, the context (current state of the code and organization), driving factors, analysis of options (pros and cons), the decision, and the consequences.

I use ADRs for larger-scale decisions with multiple stakeholders and conflicting priorities.

One thing I find useful is that ADRs capture the current organizational context which is ephemeral and is never recorded otherwise. Examples might be (1) one team is overloaded while another one is idle, so we might lean to the solution that utilizes their expertise, (2) the other team in the organization is developing a similar solution, so we might piggyback on that program, but that adds an additional dependency on the other division, and we know they are slow to react.

Srini Kadamati 2020-09-25 15:07:42

I’m a data scientist by background, and a lot of this PL stuff is new to me. However, I think data science is an interesting use case for innovation in PL. The most common use cases are a bit more bounded and well defined, the persona base ideal (people who just wanna do data stuff, not program), and there’s a non-PL success here already (Excel!).

Are there others that are motivated by this data science use case / working on it? I know Instadeq is here, I’m sure there are many others. I’ve chatted with Erik Blakke about his Ultorg system.
Why haven’t we seen a good live programming language for data science? It’s so ideal for it! Everything from sampling / streaming in data results to keep things live to the fact that data analysts / scientists want to move / iterate at the speed of thought, and most of data science is this type of curiosity driven stumbling around the data-dark

Mariano Guerra 2020-09-25 15:23:33

I guess because data science as a field is fairly new and most of the people here are programmers and want to improve the tools they work with, that is, programming languages.

Mariano Guerra 2020-09-25 15:25:13

for example R which is used a lot in data science was created mostly by statisticians

Eric Gade 2020-09-25 15:40:07

Have you looked at tools like Roassal?

Srini Kadamati 2020-09-25 15:40:14

eh not sure I agree. Data science was the first application of programming to begin with! In the broadest sense, initially it was more simple computations

Srini Kadamati 2020-09-25 15:40:28

R is definitely interesting. Super wonky language but very friendly for statisticians

Srini Kadamati 2020-09-25 15:40:45

at my last job we had to add an R learning track b/c Python was too much for most people (too many CS-y concepts to learn)

Srini Kadamati 2020-09-25 15:40:52

with R, you just install RStudio

Alex Wein 2020-09-25 16:06:33

I'm a data analyst working primarily in RStudio, and I'd argue that the R/RStudio/Tidyverse stack is already a very good live programming environment for data work. R's wonkiness as a language is only an issue if you're coming from a more traditional PL. I'm also a huge fan of Observable, and with the recent release of https://observablehq.com/@uwdata/introducing-arquero, Observable and JS is a viable platform for lots of data work.

🔗 Introducing Arquero

Srini Kadamati 2020-09-25 16:08:05

I love the premise of Observable but I personally really struggle to use JS (even their dialect of JS). I need to spend more time here

Srini Kadamati 2020-09-25 16:08:21

Arquero looks neat…

Alex Wein 2020-09-25 16:17:41

Yeah, Arquero is heavily influenced by R's dplyr library, which is like 70% of why I love using R, and it seems like "tidy" data is catching on in JS. But both Observable and RStudio can be used without learning any command line (library installation is in-language, Observable punts on git altogether). I think it's a really interesting question which CS concepts you need to know to be an effective data scientist (broadly defined). My career might have gone way different if I had been able to install a Python library like 10 years ago when I starting teaching myself how to code.

Srini Kadamati 2020-09-25 16:30:21

interesting, what do you think would have been different about your career (hypothesizing) if you had picked up Python / grinded through the CS stuff?

Srini Kadamati 2020-09-25 16:30:34

from my perspective, the “CS stuff” you need to know should mirror the conceptual structures in your head

Srini Kadamati 2020-09-25 16:30:56

tables / spreadsheets / csvs / databases are all a format, a CS concept, and a mental representation

Alex Wein 2020-09-25 16:43:27

In R for instance, data frames are lists of columns, but when I'm wrangling data, I don't actually think of them as such, they're just data frames. Likewise for the non-standard evaluation that makes the syntax more user friendly that I still only vaguely understand. I don't think loops are a particularly complicated idea, but I also don't know that loops necessary to do data work. Excel doesn't have named functions and didn't have the concept of a table until a few years ago. Even the different between in-memory and on-disk data feels like accidental complexity from the perspective of a data work. But I mostly mean things like git, docker, and https://pg.ucsd.edu/command-line-bullshittery.htm.

Srini Kadamati 2020-09-25 16:46:31

right that makes sense

Andrew Carr 2020-09-25 21:56:54

Mildly related, I actually have a little blog where I solve data science problems with "esoteric" languages.

It has mostly turned into just lesser known languages. But I am inspired by the idea of a top to bottom language built with data science at it's heart.

Srini Kadamati 2020-09-25 21:59:12

woah woah woah, I NEED to read this 🙂

Srini Kadamati 2020-09-25 21:59:28

I’ve been thinking of starting a data tools / languages focused blog

Gabriel Pickard 2020-09-26 01:27:07

I'm mainly interested in the distributed computing wing of data science. Having dealt with spark and Hadoop, I really think they're very ripe for a better programming experience

Jack Rusher 2020-09-26 07:37:59

Have you had a look at Julia? It's basically an infix re-implemention of Common Lisp that, while general purpose, is targeted toward exactly this niche:

https://docs.julialang.org/en/v1/stdlib/REPL/

Konrad Hinsen 2020-09-26 10:10:38

Julia is definitely worth a look for anything data science, if you don't depend on libraries in other languages. I wouldn't quite call it a variant of Common Lisp, since its live programming support is not at the same level, but it's certainly better than any of the more established data science languages.

One potential issue to watch out for is the enormous dependency stack of Julia, being based on LLVM. If you ever find yourself having to install it from source, that's a major undertaking, and if you need your code to work for ten years, the fragility of that stack could also become a source of trouble.

Srini Kadamati 2020-09-26 15:36:26

yeah I’ve played with Julia a bit

Srini Kadamati 2020-09-26 15:36:39

I think it’s fine for their designed purpose, which is “high performance-ish step child of Python / R but modern”

Srini Kadamati 2020-09-26 15:36:48

but its not a way of computing 🤔

larry 2020-09-26 16:41:56

Andrew Carr Would you share a link to your blog?

Andrew Carr 2020-09-26 16:46:08

Happy to! https://andrewnc.github.io/blog/blog.html

I haven't been allowed to write this summer because of my internship. But I have 2 or 3 posts on the pipeline for the end of the year.

I hope you enjoy reading, they're quite simple, mostly me recording my experience and experimentation.

I'll gradually add more "deep" PL and data science topics as time goes on.

🔗 Andrew Carr blog

Andrew Carr 2020-09-26 16:48:33

I probably have 15 or so languages I want to try out still too. I'll probably compile a list eventually of the cool esolangs I find

larry 2020-09-26 16:48:40

tables / spreadsheets / csvs / databases are all a format, a CS concept, and a mental representation. They're all tables 😉 Spreadsheets are tables with reactive function execution. (relational) Databases are collections of related tables, with the relationships themselves maintained in other tables.

larry 2020-09-26 16:49:48

Andrew Carr thanks!

larry 2020-09-26 16:59:10

@Eric Gade The first time I saw a Roassal demo video, I was so hyped, I tweeted it with a comment paraphrasing Arthur C. Clarke: "Any sufficiently advanced technology is indistinguishable from Smalltalk 80." Unfortunately, I could rarely get it to work reliably. I really wish someone would address Smalltalk's module system issues.

larry 2020-09-26 17:04:30

In R for instance, data frames are lists of columns, but when I'm wrangling data, I don't actually think of them as such, they're just data frames.Alex Wein That's funny. I almost always think of them as lists of columns, unless I'm reading or writing them from/to disk. OTOH, I almost never extract a column from a table to use a column operation directly. It's always t.col(whatever).

Paul Butler 2020-09-26 17:20:14

Regarding live coding in Julia, I've heard good things about Pluto.jl https://github.com/fonsp/Pluto.jl

🔗 fonsp/Pluto.jl

Eric Gade 2020-09-26 17:20:43

@larry I don’t really use Roassal myself, but I do know they’ve been hard at work on version 3 which hopefully provides more stability. I think it follows the new Iceberg/Baseline git-based installation. It will also come built-in to Pharo 9, which is current itself in dev phase:

https://github.com/ObjectProfile/Roassal3

🔗 ObjectProfile/Roassal3

larry 2020-09-26 17:21:15

FWIW, I think Wolfram/Mathematica is a language/toolkit that is worth considering when talking about data science languages. Its lispy style and abilities in the code-as-data area are interesting/impressive.

larry 2020-09-26 17:26:25

@Eric Gade I don't think the stability issues are Roassal-specific. I think it's a function of how Pharo combines code from various sources into one super-environment, especially without the relative safety provided by static type and version checking. Direct updates to the system classes are another issue.

Base Pharo simply has 'too much junk in its trunk' to ever be reliable. The Date class, for example, has an easter function that, IIRC, returns the day Easter falls on in any given year.

Eric Gade 2020-09-26 17:31:41

I’ve not had too many problems with this kind of stability in Squeak or Pharo, but the environments to expect users to be a little more proactive in managing what’s going on

Eric Gade 2020-09-26 17:32:41

Both are still full of older classes/methods, like you’ve pointed out. But they’ve slowly been purging this stuff. The Pharo team actually builds the image from the ground up now, so anyone can bootstrap their own minimal images as needed

Eric Gade 2020-09-26 17:33:09

And yes, static type checking is not something you will get in that environment by definition and specific intent

Srini Kadamati 2020-09-26 18:36:00

@larry for sure, I need to play with Mathematica more. It gets a bad rep cuz its not open source but

Srini Kadamati 2020-09-26 18:36:06

they pioneered in interactive / notebook driven exploratoin

larry 2020-09-26 19:23:11

@Eric Gade I understand. I programmed in Smalltalk for years and love the language, but it hasn't evolved. Anyway, wrong thread for this 😉

Jack Rusher 2020-09-27 07:13:10

Konrad Hinsen I usually use the Julia support in ESS mode, which -- while not as good as SLIME -- gets me completion, "evaluation in place", and many of the other interactive programming goodies without which I'm always grumpy. My understanding is that colleagues mainly use Juno to get this sort of setup in what the kids like to call a "modern" editor: https://junolab.org

nicolas decoster 2020-09-27 16:26:52

Does anyone know or has used good visual programming for data science (I mean à la Scratch or à la PureData)?

I guess I will explore this space at some point. Maybe I will try to create a data extension for Scratch (maybe using Arquero, didn't know it and it looks like a good candidate).

larry 2020-09-26 19:33:21

I don't really understand Roam, although it is favorably mentioned here frequently. I tried a couple of videos but quit after 15 minutes when all they've demod is linking between notes. It seems like watered down NoteCards, with a (really) nice UI. Everyone here is probably familiar with NoteCards but if not, the first two linked papers are very good. https://en.wikipedia.org/wiki/NoteCards

Emmanuel Oga 2020-09-26 19:40:53

you just gave a few reasons: notecards is no longer available afaict, and roam has a nice UI, so if you want to write notes with links between them, you may as well use roam, no? 🙂

larry 2020-09-26 22:55:06

I guess. It's just that at $165/$500 per-year depending on the plan, I assumed I had to be missing something.

Emmanuel Oga 2020-09-26 23:07:23

ah, per year

Emmanuel Oga 2020-09-26 23:07:39

well yeah sounds a bit expensive for what it is

Alexey Shmalko 2020-09-27 06:09:22

I switched off from roam long time ago, so I can't say what they have now, but the killer features at the time were non-hierarchical note taking, easy linking, and backlinks. That's a good way to create a personal knowledge base and that's what most people did. I looked around at that time, but there were little alternatives that offered these features (or they weren't widely known).

I know they have 3rd-party plugins now and people do some interesting stuff with them. That's another reason I see Roam mentioned here.

Also, Roam used to be free until this summer.

Roam introduced me to Zettelkasten which is a great method for maintaining a knowledge base.

So if you want a knowledge base, my recommendation for most people is to learn Zettelkasten (How to Take Smart Notes by Sönke Ahrens is a great book) then pick a simple markdown editor with links (zettlr or obsidian.md seem to be good. if you're into Emacs, use org+org-roam). i.e. you don't need roam for that.

You are viewing archived messages. Go here to search the history.

You are viewing archived messages.
Go here to search the history.