You are viewing archived messages.
Go here to search the history.

Michael Dales 2022-11-14 10:36:11

As promised in my intro, here’s a little bit of current thinking I wrote up about what feels like I’m going to end up building a DSL to let ecologists work on large datasets.

digitalflapjack.com/blog/yirgacheffe

Currently it’s a Python library that lets ecologists just work with large geospatial files as if they’re variables, like numpy does, but manages the memory side of things, as these files quickly can cause you to run out of memory even on a 1TB RAM machine like we have in the group.

The next thing I have in the wings is trying to hide Python multiprocessing support behind my library, for two reasons:

  • I’m already chunking up the computation, so throwing it out to many cores to get concurrency seems like a natural fit. Early on in my time here I tried to encourage the ecologists to think in terms of small programs that would then be run concurrently, but that didn’t really work for how they think of programming, so I think letting them think single threaded whilst I hide concurrency is probably easiest.
  • GDAL, the library used for a bunch of the actual geospatial transforms is leaky as anything (or rather, it’s Python bindings are), and so pushing that into child processes is a win, as it’s the only real way to clean up properly.

But at this point I think what I’m not doing is a good fit for Python anymore, so I kind of envisage a Go backend, where I handle the concurrency side of things, and a small front end language where I let the ecologists reason about geospatial files as opaque blobs, and possibly have CSV as a natural thing.

But I also think this is a good fit for a visual programming language, where you connect the CSVs and geospatial files by the kind of operators you’d do normally, and then having that be the ecologist view on the world.

I guess my main aim is to try and not do everything though - if you imagine this was wildly successful, then all I’m doing is making yet another data-processing system that is specialised in one thing that someone will then want to do what I’ve done for some other metric down the line. So I think I want to deliberately keep this focussed/niche rather than accidentally drift into building something generic that will inevitably not be good for other purposes.

Kartik Agaram 2022-11-14 23:53:00

From github.com/carboncredits/yirgacheffe#basic-layer-usage:

elevation_layer = Layer.layer_from_file('elecation.tiff')

area_layer = UniformAreaLayer('area.tiff')

validity_layer = Layer.layer_from_file('validity.tiff')



# Work out the common subsection of all these and apply it to the layers

intersection = Layer.find_intersection(elecation_layer, area_layer, validity_layer)

elevation_layer.set_window_for_intersection(intersection)

area_layer.set_window_for_intersection(intersection)

validity_layer.set_window_for_intersection(intersection)



# Work out the area where the data is valid and over 3000ft

def is_munro(data):

    return numpy.where(data > 3000.0, 0.0, 1.0)

result = validity_layer * area_layer * elevation_layer.apply(is_munro)



result_band = result_gdal_dataset.GetRasterBand(1)

result.save(result_band)

I don't follow why the area_layer is in the computation of result . Or is it perhaps not needed in the definition of intersection , just result ? Are the multiplications over matrices (so the ranks need to line up)? Just trying to make sure I understand the example. I don't know anything about GDAL.

Konrad Hinsen 2022-11-15 08:07:31

Interesting stuff, I'll have to look at it in more detail (I am also in scientific computing). But a first reaction: isn't it surprising that ecologists have a hard time thinking in terms of small interacting programs? Isn't that very similar to an ecosystem in nature? In other words, shouldn't it be possible to present the situation in a way that they understand it?

Michael Dales 2022-11-15 09:46:40

Kartik Agaram Oh, I’d not read too much into the example in detail, this was derived from a real example we worked on. The area_layer is a GeoTiff that contains in each pixel the real area of the planet covered by that pixel. So if that last pair of lines was replaced with:

result.sum()

The number you’d get would be the square meterage of planet covered by munros. (Assuming validity layer is a layer that just covers Scotland 🙂

For the actual work we used this tiff in another set of calculations. In that work we’re reasoning about area of habitat of species, so the result in that case is a GeoTIFF per species that has a 0 value where the species isn’t, and an area of the pixel if the species is there.

Michael Dales 2022-11-15 09:55:19

Konrad Hinsen I think the problem is less that ecologists have hard time thinking about small programs, they’re quite happy scripting away, but that the side effects of actions are not always apparent. This, trying to bring it around to the Future of Coding’s most recent episode, to me is why I don’t think you can “Do The Right Thing” as readily as people think - there’s some unexpected side effect going to bite you that you didn’t think of (unless you ground up wrote your OS and application in a provable language I guess, but I’m not that person 🙂

It’s then less about capability, more about time to be experts? Knowing all those side-effects (if you run out memory Linux will vomit everywhere and die) is a computerists job to understand. The interaction of animal species and habitats is for the ecologists to understand, I don’t know how to do that for similar reasons they don’t know about lazy evaluation or such. But I don’t think it should be that an ecologists needs me all the time - I want to build a system that Does The Right Thing by having me encode the grotty resource management inside.

Konrad Hinsen 2022-11-15 15:18:04

Doing "the right thing" is possible only if the abstractions you implement are actually implementable with the available resources and for the intended applications. Otherwise you implement leaky abstractions, which is not helping clients that much.

Konrad Hinsen 2022-11-15 15:26:06

I read your post in the meantime... Some remarks on NumPy (I was part of the team that designed its predecessor, Numeric): the goal of Numeric (inherited by NumPy) was to provide 1) "APL embedded into Python" and 2) a Python interface to array data used in C extensions. APL, in turn, was originally designed to be a new mathematical notation for use by humans, and only later implemented on computers. This implies the "right thing" approach to arrays as a high-level data model. Back in those days (mid-1990s), scientific computing applications tended to be more CPU-limited than memory-limited, so we thought of an escape hatch for performance bottlenecks (C extensions, and later Cython), but not so much for memory bottelnecks. That was exactly what caused the fork called "numarray", which catered for large-memory use cases but turned out to be very inefficient for small arrays. NumPy re-united the two, at the cost of a more complex interface, but didn't extend "large memory" to "doesn't fit into memory" datasets.

None of that is clear from the documentation, as so often. So people like you discover the unwritten assumptions at some cost. Sorry!

Michael Dales 2022-11-16 18:32:38

Konrad Hinsen that’s super interesting, thanks for the insight! No need to apologise, I learned a lot along the way 🙂

Kiril Videlov 2022-11-17 10:46:10

I would like to share with you my open source project - Semantic Code Search ( github.com/sturdy-dev/semantic-code-search )

The tool lets you search code with natural language (you don’t need to know the exact keywords), for example:

  • ‘Where are API requests authenticated?’
  • ‘Saving user objects to the database’
  • ‘Handling of webhook events’
  • ‘Where are jobs read from the queue?’

It’s a command line app and is fully local, and it was so much fun to build (and train)

Lu Wilson 2022-11-18 07:45:02

hello everyone I did a little spatial-programming demo at the London meetup yesterday and fortunately, the second half of it got filmed! You can watch it here: youtu.be/bqtVv9ts29c

and it was great to meet so many more people at the meetup!

Kartik Agaram 2022-11-18 08:37:36

Spatial programming is a nice name for this. It reminds me of this project by David Ackley: movablefeastmachine.org

Michael Dales 2022-11-18 08:56:28

That’s wonderful! Is there somewhere we can learn more about what’s happening for those of us not there?

Lu Wilson 2022-11-18 09:00:08

Kartik Agaram hey yes! Dave is a huge inspiration! In the first part of the talk I showed some of my influences (including Dave's language SPLAT)

Lu Wilson 2022-11-18 09:05:29

@Michael Dales thanks very much! I've made a few videos about it (which may or may not answer questions)

Michael Dales 2022-11-18 09:07:21

Thanks! I love the elemental demo, that was just there in passing in your talk (at least in the bit on video) but expressed something lovely.

Lu Wilson 2022-11-18 09:27:13

thanks haha yes - there were a couple of references to my videos in the talk :)

Nilesh Trivedi 2022-11-18 14:18:39

Not related to coding, but I have long been annoyed by how hard it was to discover and curate the best learning resources of the Web. Google/YouTube/Wikipedia do a terrible job of it and so do universities/coursera/edX by never linking out to the Web (eg: 3Blue1Brown or SmarterEveryDay). I wanted to collect links to amazing videos, interactive explorables, books, research papers and organize them by topics, formats, difficulty level etc.

This is what I made: learnawesome.org

Some highlights:

  • There is a custom topic taxonomy and a zoomable treemap.
  • For books and papers, I make it easy to find them on SciHub/LibGen etc. When possible, resources like videos/wikis are embedded directly.
  • I am planning aggregate reviews to help you select a resource among many available. So, under "Sapiens", you will see that Bill Gates recommends it highly.
  • There is no server and no user accounts. Your bookmarks are kept in browser's localStorage only.
  • Above all, this is a work-in-progress. Expect breakage.

Would love to hear thoughts and feedback. 🙏

📷 image.png

📷 image.png

Leonard Pauli 2022-11-18 17:41:33

Neat! Have been missing work in this direction :)

Maikel van de Lisdonk 2022-11-20 15:21:06

Hi, I've finally found time to make a small video.. I am showing a flow that represents a very small crud-application where I reconnect multiple connections at once as well as connections being animated when they are retriggering (like when using a timer). Hope you like it! youtu.be/N8gIblu1dgs

Jarno Montonen 2022-11-20 16:10:57

interesting take on using the nodes as kind of a user interface. Do you see this as something that end users could use or more of a debugging feature?

Maikel van de Lisdonk 2022-11-20 16:45:52

Hi, no these flows are not for end-users. Although I think these can be used in the broad development process and not just by "flow developers", also for analysts and other roles.. offcourse the flows that are shown are very simple, I've already got examples based on real scenario's which are much more complex. In the near future I'll demonstrate how these flows can be translated to a UI for end-users