Amateur Hour

What I've Been Reading (Aug 2022)

Or a short log of things I read last month.
Sep 4, 2022. Filed under Musings
Tags: reading

Books

A Thousand Acres

https://www.amazon.com/Thousand-Acres-Novel-Jane-Smiley/dp/1400033837

A Thousand Acres basically an adaptation of King Lear, set in rural Iowa in a rural farming community. It’s a deliberate reinterpretation, portraying Larry / King Lear much less sympathetically (and trying to humanize the “evil” daughters Ginny / Goneril and Rose / Regan). The ending is also fairly different (although still quite bleak).

Overall, I enjoyed the book – it was well-written and engaging, but there was nothing that made me really fall in love with it. I did learn a surprising amount about farming though (although I question how accurate the depiction of rural life was, given that the author doesn’t seem to be from a farming background themselves). For instance, I had no idea that you needed to install drain tiles (essentially underground pipes that drain ground water) into fields to allow heavy machinery to work.

One interesting thing about the experience is that when I read the book summary on the book (which references “dark truths” about the family that are brought to light), I immediately jumped to the idea that “the father (King Lear) sexually abused his daughters” (although I wasn’t certain whether Cordelia or Goneril + Regan were abused). It turns out I was correct (Goneril + Regan were raped in this telling), but I’m not entirely sure what this says about me or our literary culture (especially since A Thousand Acres is from 1991 and not particularly contemporary).

Articles / Blog Posts

Papers

Birds and Frogs

https://www.ams.org/notices/200902/rtx090200212p.pdf

This was a nice fun paper contrasting two different “types” of thinkers: “frogs” who get down-and-dirty with very specific problems and “birds” who soar above and try to connect different areas of mathematics together. Dyson goes through a series of famous mathematicians/physicists trying to classify them as either birds and frogs, arguing that both are necessary for forward progress. I found this paper fun and breezy to read (with many historical tidbits I wasn’t familiar with like von Neumann’s “warmed up soup” lecture)

Photon: A Fast Query Engine for Lakehouse Systems

https://doi.org/10.1145/3514221.352605410.1145/3514221.3526054

This paper describes Databricks' new (proprietary) query engine for Apache Spark built for “lakehouse” architectures (architecture where queries executed against “raw” data (e.g. Parquet files on S3) that haven’t been cleaned, indexed, or curated).

This was definitely more of an engineering-oriented paper – I don’t think there was anything super novel on the design side (this seemed like a fairly “standard” C++ vectorized query engine with microadapatvity a la Vectorize). The interesting meat in the paper was about the actual engineering experience things like “how to get the JVM and C++ memory managment to play well?” or “how to ensure compatibility with existing Spark operators?” or “what was our testing strategy” or “what was our testing strategy?”.

The authors' arguments about why they chose a vectorized query engine instead of a query compiler was particularly interesting since Databricks has already built a Spark query compiler before. Some of the tradeoffs are pretty similar to the ones noted in Kersten et al’s bakeoff, namely that:

I do wonder how true these claims would be if there was somebody w/ expertise in JIT compilation (a la V8 or Graal), since they have to deal with pretty much all of these issues. I always assumed that this kind of thing was deep black magic, but maybe that’s just a measure of my own ignorance rather than any objective sense of complexity?

Although I thought these kinds of engineering considerations were really well written + explained, I found the experimental evaluation to be less satisfying – Databricks mostly benchmarked against fairly artificial micro-benchmarks. It would’ve been nice to dig into the TPC-H and TPC-DS results a little more and pinpoint (1) where time is still being spent in Photon, (2) how things stack up against an old school “data warehouse” type architecture, and (3) how crucial the adaptive runtime is to the speed-ups. My impression was that while the “sparsity adaptation” (e.g. adapting based on how much data is NULL) is super important, I didn’t get a good sense of how common their “string” happy-paths were (e.g. ASCII-only vs UTF-8, or UUID specialized query paths), or whether there were other heuristics to store.

Fixes That Fail: Self-Defeating Improvements in Machine-Learning Systems

https://proceedings.neurips.cc/paper/2021/hash/619427579e7b067421f6aa89d4a8990c-Abstract.html

I would call this a “Machine Learning Engineering Vision” paper – I think it did a good job identifying a set of challenges for building real-world machine learning systems, but it didn’t go much further than that.

The problem setup is that, in practice, many machine learning systems are not trained end-to-end. Instead, you often have multiple independent machine learning algorithms that depend on each other’s outputs (e.g. a generic “embedding” model with different feature heads that are trained separately).

The authors introduce a decomposition of the “system-level” error into three parts:

The basic problem is that improving the upstream error can actually make these errors worse, even after retraining the downstream model: for instance, the upstream model improvement might not be exploitable by the downstream model architecture (increasing the downstream approximation error), or the upstream model’s loss function diverge from the downstream model’s loss function (increasing the upstream error).

They end with a self-driving car example, which I found a little artificial (although I’m also not familiar with the domain). They didn’t really introduce any solutions to the challenges presented here, and was more of a call-to-action.

Plor: General Transactions with Predictable, Low Tail Latency

https://doi.org/10.1145/3514221.3517879

This paper introduces a new concurrency control protocol for databases with an emphasis on lowering tail latency. Traditional 2PL can have poor throughput in low-contention scenarios due to lock overhead, while OCC can have extremely poor tail latencies since “unlucky” transactions can just endlessly retry. The paper introduces a “Pessimistic Locking Optimistic Reading” protocol that tries to blend the two ideas.

I found the paper’s description pretty hard to follow – I don’t know what it was, but I had to go through Section 4.1 three or four times before I felt like I really understand the protocol. I think of PLOR as basically OCC, except we always try to abort the younger transaction (allowing the older transaction to make progress). They do this by (1) using locks to eagerly resolve write-write conflicts during the initial OCC “read” phase (using WOUND WAIT) and (2) recording which transactions are reading/writing to each row so we can know which transactions to abort when checking for conflicts.

Although this was an interesting idea, I didn’t find this paper convincing on a number of levels:

HINT: A Hierarchical Index for Intervals in Main Memory

https://doi.org/10.1145/3514221.3517873

This was a nice, sweet paper about intersection queries for contiguous ranges – basically, given a set of \([a, b]\) intervals, we want to find all intervals that intersect some \([l, u]\) query.

The image below is the key idea: given a \([0, 2^m]\) domain, we can recursively (and hierarchically) partition up your space into “subintervals”. Each subinterval \(P_{l, f}\) stores the set of intervals whose integer bounding box is fully contained within \([f \cdot 2^{m-l}, (f + 1) \cdot 2^{m-l}]\). Note that any interval we want to store is going to be contained by at most 2 subintervals per level.

HINT partitioning diagram

To execute queries, we just loop through all “subpartitions” that intersect our query interval and search their subcomponents. In effect,

The rest of the paper defines how to generalize this scheme to arbitrary domains (essentially define a remapping function into \([0, 2^m-1]\)), how to define the optimal \(m\) (trading off # of subintervals to visit with the # of stored intervals per subinterval you need to handle), and some implementation optimizations.

I didn’t read the performance evaluations super closely since I’m not familiar with the evaluation norms here, but the evaluation seemed detailed, plausible (a mix of real-world and synthetic data on a variety of conditions), and impressive (much faster queries than competitors at the cost of slower updates). Overall, I thought this was a neat idea!

Finding real bugs in big programs with incorrectness logic

https://doi.org/10.1145/3527325

This paper goes through some details about actually applying incorrectness logic into real-world static analysis.

Incorrectness logic is the sort of “dual” with Hoare logic – in Hoare logic, you have {P} C {Q} which roughly means “if P is true and then code C executes, then Q must be true”, while in Incorrectness logic you have [P] C [Q] which means “if P is true and C executes, then Q can be true”. Hoare logic is often used for formal verification (e.g. Q is roughly “implemented correctly”), but the authors argue that incorrectness logic is more natural for bug-finding static analysis (Q is something like “is a bug”).

I don’t understand the type thoery enough to fully grok the details of the analysis, but it roughly follows the shape of the “compositional analysis” described here. I think the interesting part from a non-specialist was the distinction between “latent bugs” and “manifest bugs”. Suppose you find a bug that “could” occur based on your analysis – who do you know whether this is a “real” bug or a missing pre-condition you didn’t know about (e.g. an undocumented assumption that a pointer is not null)? The authors argue that the “right” answer is based on social reasons (what do software engineers expect in practice?) and come up with some heuristics here.

Decentralized Society: Finding Web3’s Soul

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4105763

This is a Web3 paper introducing the idea of “soulbound” tokens. A “soul” is basically a single wallet / digital entity, and a soulbound token is a token that can’t be transferred (only given out / revoked). The big idea is to use soulbound tokens to represent community identities (for instance, a school “soul” could issue soulbound tokens to its alumni) – if there are enough souls with overlapping communities, this could plausibly let you capture lots of social information and expose it to smart contracts (e.g. to distinguish real humans from bots).

I thought this was definitely a clever idea that opens up some interesting possibilities, but I was less convinced that any of the suggested applications were actually good ideas as-is. Like most of the work on Web3 here, this feels very much like a hammer looking for a nail. Like, these are definitely a cool set of ideas and I could totally believe that the “next big thing” will build on top of these, but I’m still sort of waiting for an actual concrete use-case to use this for (that isn’t self-referentially motivated by interoperability on some blockchain).

Enabling the Next Generation of Multi-Region Applications with CockroachDB

https://doi.org/10.1145/3514221.3526053

This was a neat paper about how CockroachDB built globally distributed replication into their database. This was a very well-written paper – I found it extremely well-motivated and easy to follow.

In particular, every table can be configured with different survivability configurations (e.g. SURVIVE ZONE FAILURE) and locality settings (GLOBAL, REGIONAL BY ROW, REGIONAL BY TABLE). Survivability settings determine how data is replicated across nodes, while the locality setting determimnes how the data is treated by the query optimizer (GLOBAL tables get low-latency reads from all regions at the expense of slowe writes, while REGIONAL rows get low-latency reads form the “home” region at the cost of higher-latency reads everywhere else).

Obviously, building this information into the databases reduces the burden on the application programmer, but it also allows CockroachDB to automatically enforce “best practices” (e.g. enforcement of global unique constraints) and to exploit region properties during query planning (e.g. minimizing data movement by exploiting UNIQUE constraints). I’m not sure they did a good job establishing that this desired information was “sufficient” for all geo-replication use-cases, but it did a good job arguing that these features worked for at least a substantial portion of use-cases.

Formality Considered Harmful: Experiences, Emerging Themes, and Directions on the Use of Formal Representations in Interactive Systems

https://doi.org/10.1023/A:1008716330212, or accessible at https://www.csdl.tamu.edu/~shipman/papers/cscw.pdf

This is an HCI paper about the mismatches between human needs and the “formality” that most computer systems require. I think the key quote here is:

When formalisms are embedded in computer systems, users often must engage in activities that might not ordinarily be part of their tasks: breaking information into chunks, characterizing content with a name or keywords, categorizing information, or specifying how pieces of information are related.

The paper goes through a couple examples of this – although the specifics are fairly dated (the paper was written in 1999), analogies to modern systems are pretty obvious (who hasn’t dealt with “what folder hierarchy should I use” or “should I setup a tagging system”?).

The paper did a good job describing the problem, but I found the suggested solutions to be unhelpful. Basically, the suggestions boil down to “make sure to use the “right” formalisms”, “see if ou can make formalization incremental”, and “see if you can rely on informal organizations (e.g. spatial organization, or textual search)”, but their examples of each were either (1) not very detailed or (2) too dated to the 1990s for me to really understand the suggestion. It’d be interesting to see a refreshed version of this paper with more modern references + examples.

Relational E-matching

https://doi.org/10.1145/3498696

This is my favorite type of paper: something that applies a technique from one field to a problem in a separate field and gets large speedups.

E-graphs are data structures for representing equivalence classes of function calls, and can be used for compiler optimizations (google equality saturation). One key part of this is “e-matching”, which is essentially pattern matching on e-graphs (e.g.X - X => 0). This paper makes two observations:

  1. E-graphs can be easily represented as relational databases
  2. E-matching essentially reduces to conjunctive queries on these databases.

Conjunctive queries on relational databases are super well-studied, so this opens up a bunch of techniques. In particular, the authors apply the “generic” worst-case optimal join algorithm to actually solve these conjunctive queries, which are (as the name implies) asymptotically optimal algorithms. Unsurprisingly, this gives large speed-ups over the previous state-of-the-art.

I think this paper was very well done, but I wish they went further on a number of axes:

A tree clock data structure for causal orderings in concurrent executions

https://doi.org/10.1145/3503222.3507734

This paper introduces a “tree clock”, which is faster version of a vector clock. The key idea is that tracking the “provenance” of information lets you “skip” some updates. For example, suppose we have the following three operatoins:

  1. \(\mathbb{C}_2 = \mathbb{C}_1 \oplus \mathbb{C}_2\)
  2. \(\mathbb{C}_3 = \mathbb{C}_2 \oplus \mathbb{C}_3\)
  3. \(\mathbb{C}_3 = \mathbb{C}_1 \oplus \mathbb{C}_3\)

Then the 3rd join operation is entirely redundant – because \(\mathbb{C}_1\) hasn’t been updated since operation 1, any “new” information that it carries must have already been transmitted to \(\mathbb{C}_3\) in step 2! A tree clock essentially takes this idea and generalizes it to more complicated cases (essentially you form a “tree” of provenances which track from which clocks you learned about which information). Thus, while vector clock operations generally take \(O(k)\) time (where \(k\) is the # of threads), tree clocks can be asymptotically faster.

I thought the idea was interesting, but found this paper to be pretty unsatisfying on a number of levels:

Handoff strategies in settings with high consequences for failure: lessons for health care operations

https://doi.org/10.1093/intqhc/mzh026

This paper is essentially a summary of a series of interviews for “handoffs” (transitions between shifts) in “high-stakes” situations (e.g. nuclear reactors, ambulance dispatches, etc) trying to synthesize “common” strategies across the two. Although I found it informative, I feel like more things would’ve struck out at me if I was actually involved in an on-call rotation or consequence-heavy operations. As-is, most of the suggestions seemed pretty common-sensical and I don’t think I really remembered much. Someone more-in-the-trenches would probably have a better sense of what’s actually important / distinctive and what’s just “obvious”.

Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics

https://doi.org/10.1257/jel.20191597

This was basically an economist comparing and contrasting the Rubin potential outcomes framework for causal inference with Pearl’s DAG framework. Imbens definitely comes down on the side of the potential outcomes framework, for a couple of reasons:

As someone who’s never done much casual inference work and has only really encountered the DAG framework this was a super interesting read. Of these, I think only the 3rd is a technical weakness of DAGs (although it’s not clear to me how potential outcomes handles feedback cycles better) – the rest of the criticisms seem pretty cultural. That’s not to say they’re illegitimate of course – I definitely sympathize with the “toy” nature of most DAG introductions (which has also matched my experience). I think point 1 might also be an interesting commentary on “users” vs “developers” here – this strikes me as similar to how people treat statistics – while a lot of statisticians I know want to build fancy generative models for inference, most people seem to get by with cookie-cutter t-tests.

I hope the DAG and potential outcomes communites grow closer in the future though and harmonize on their notation + jargon – as a total outsider to causal inference, it’s pretty frustrating to have to navigate 3 different frameworks (DAGs for computer scientists, potential outcomes for economics, and marginal structural models / G-models for epidemiology) and it’d be great to have some kind of paradigm unification (if only to make it easier to understand what’s going on).

Understanding and exploiting optimal function inlining

https://doi.org/10.1145/3503222.3507744

This was a really fun paper, with a very simple idea: brute force all possible inlining possibilities and check how close Clang’s -Os setting gets to the smallest possible size. Naively, if you have a call graph of \(E\) edges there are \(2^E\) possibilities which is clearly way too much to search for even mildly complicated programs. But the key insight is if you can “split” call graph into two components with a single edge between them, then you only need to search \(2^{E-1} + 2^{E_1} + 2^{E_2} 1\) possibilities (\(2^{E-1}\) possibilities if you inline that edge and \(2^{E_1} + 2^{E_2} + 1\) if you don’t). This recursive partitioning turns out to be “good enough” that you can actually search mildly complicated programs exhaustively (the authors choose the SPEC2017 benchmark suite).