Flowing Images in the Editor

While building the math-conceal.nvimmath-conceal.nvim plugin, I needed to construct a mechanism that could withstand heavy editing and remain theoretically defensible. The goal of this plugin is to implement graphical formula rendering in Neovim: on one side, it preserves standard ASCII-form math conceal and uses decoration-providerdecoration-provider for finer-grained expansion; on the other side, it places formula images back into the buffer as overlay conceal, making formula editing as smooth as possible.

image
An example of formula images in math-conceal.nvim displayed along with the text flow.
image
An example of formula images in math-conceal.nvim displayed along with the text flow.

For a static buffer, this is not difficult. The classical mechanism is:

  • Scan the whole buffer with Tree-sitter and collect all nodes that need rendering;
  • Send the text in each Tree-sitter math node to the backend renderer and compile it into an image;
  • Tree-sitter math nodes carry location information, and then extmarks or an image protocol place the images back where they came from.

This is in fact completely sufficient for a static buffer. However, if we try to examine a dynamic buffer, the problem becomes much larger.

The Flicker Window in Static Scanning

Given a static rendered configuration, we can roughly write:


            
buffer(i) = collection of { node_n(i): location + source + image binding }

            
buffer(i) = collection of { node_n(i): location + source + image binding }

            
buffer(i) = collection of { node_n(i): location + source + image binding }

            
buffer(i) = collection of { node_n(i): location + source + image binding }

After refreshing the configuration once, we need to examine the next version:


            
buffer(i+1) = collection of { node_n(i+1): location + source + image binding }

            
buffer(i+1) = collection of { node_n(i+1): location + source + image binding }

            
buffer(i+1) = collection of { node_n(i+1): location + source + image binding }

            
buffer(i+1) = collection of { node_n(i+1): location + source + image binding }

If we collect mathematical formulas by scanning the buffer every single time, this effectively means that we must pass through a flickering gap window.

N₀/B₀ denote the node/binding collections before scanning; N₁ denotes the node collection after scanning.

Past approaches would use methods such as hashmaphashmap to reduce the intervening flicker gap and reuse assets as much as possible. That is, through cache hits, the bindings before scanning are reattached to the nodes after scanning as much as possible.

The flicker problem can in fact be mitigated by reusing the rendered image from the previous version. The more serious problem comes from the misalignment between assets and source code.

Its source is the same as in the previous version: after a node is displaced, the node’s source code has already changed during the scanning stage, but because the new Tree-sitter span has not yet been computed, the position of the new image is unknown. This causes misalignment.

Unlike flicker, this problem is hard to solve through hacks unless we can predict the user’s behavior. That can only be achieved in limited scenarios by intercepting user requests and similar means; it is extremely complex and extremely fragile. After a large number of edit requests, it can also overload Neovim’s thread and bring lag with it.

We can roughly abstract the mathematical formula rendering of the entire buffer as a combination of two basic operations:

  • Editing a non-target node;
  • Editing the source code of a node.

In Typst, the target nodes correspond to the mathmath nodes for mathematical formulas and the codecode nodes for functions.

The second editing mode makes it very hard to avoid stale assets: when we update the source code, the old asset inevitably becomes stale and needs to be updated through the renderer. Of course, we can let “the old asset remain bound to the node while the new asset has not yet updated”, but since the node itself is also changing, this faces a series of heuristic judgments. Fortunately, in the actual editing process one necessarily looks at the source code, so when editing the source code of a node, we can temporarily avoid binding the asset on the node until leaving the node.

The first editing mode is completely different: if the edit happens outside the formula, the asset itself has not changed. At this point, the most ideal mode should be that the asset always moves with its bound node, instead of being constantly rescanned, rematched, and remounted during editing. If we still rely on all kinds of heuristic behaviors to cover up flicker and geometric misalignment, then we have not solved the problem; we have only covered up this essential problem.

Remark
When using a coding agent for debugging, this behavior of covering up problems with heuristics often appears. This is not the agent’s fault; after all, when the architecture is not being substantially changed, this kind of heuristic operation is the cheapest. Thanks to the agent’s extremely high code-generation speed, in the short term it really can look as though the problem has been solved, but in the long term it leaves behind a mountain of “shit code”.
Remark
When using a coding agent for debugging, this behavior of covering up problems with heuristics often appears. This is not the agent’s fault; after all, when the architecture is not being substantially changed, this kind of heuristic operation is the cheapest. Thanks to the agent’s extremely high code-generation speed, in the short term it really can look as though the problem has been solved, but in the long term it leaves behind a mountain of “shit code”.

Image Assets Should Move with the Text Flow

The ideal approach should be this: for edits that do not change the source code of a node, treat the image asset as the same kind of resource as the text it is bound to. When a text editing event happens in the editor, they should update at the same time.

For a text editor, text display should naturally satisfy this capability; otherwise, it is simply not a qualified text editor.

For example, when I insert text at [cursor-position][cursor-position] in


            
Do not believe in [cursor-position]miracles; believing is itself a miracle

            
Do not believe in [cursor-position]miracles; believing is itself a miracle

            
Do not believe in [cursor-position]miracles; believing is itself a miracle

            
Do not believe in [cursor-position]miracles; believing is itself a miracle

the subsequent “miracles; believing is itself a miracle” should be “pushed” backward, rather than overlapping with the text I just entered. A mature text editor must have this function; otherwise, the text editor is unusable.

This model completely coincides with the image model:

  • Unedited text image assets of unedited nodes;
  • Editing other text makes unedited text move editing other text makes the image assets of unedited nodes move.

Therefore, if we can obtain the editor’s mapping for “changes in text positions during edit events”:

then converting text-positiontext-position into the corresponding node can implement the displacement mapping of nodes during edit events. Further, binding images can also implement the displacement mapping of a node’s image asset during edit events.

The whole process does not need to use position-computation resources beyond the editor’s own capabilities. Therefore, in theory, its update-speed ceiling is equal to the text editor’s own rendering-update ceiling.

Remark
Protocols such as the kitty graphics protocol, which locate mathematical formulas based on placeholderplaceholder, naturally support “managing image positioning the way text positioning is managed”, so the update speed of image assets is basically equal to the speed of updating text. By contrast, for protocols such as the iTerm2 image protocol, which draw images directly based on terminal escape sequences, image positioning is not naturally part of Neovim’s text layout system. We need to manually attach and detach image assets, so performance is worse.
Remark
Protocols such as the kitty graphics protocol, which locate mathematical formulas based on placeholderplaceholder, naturally support “managing image positioning the way text positioning is managed”, so the update speed of image assets is basically equal to the speed of updating text. By contrast, for protocols such as the iTerm2 image protocol, which draw images directly based on terminal escape sequences, image positioning is not naturally part of Neovim’s text layout system. We need to manually attach and detach image assets, so performance is worse.

Returning to First Principles

This in fact determines a basic principle: we cannot locate mathematical nodes by directly parsing the source code. Directly parsing the source code leads to the mismatch phenomenon described above. What truly needs to be maintained is the geometric configuration inside the editor’s rendering flow, which is then consumed by the subsequent UI pipeline.

At the implementation level, we need the text editor to expose the mapping of text position changes during edit events, or at least to let us piece together such a mapping ourselves. Fortunately, Neovim has long provided an official means to express the behavior of elements moving along with text edit events: Neovim’s extmarks give us geometric anchors that automatically evolve with text-flow edits, while events such as on_lineson_lines allow us to obtain the edited region for subsequent use.

However, knowing the text structure alone is not enough, because when we define nodes such as mathmath and use them for image rendering, we inevitably need some semantic way to extract the code blocks whose positions may change during editing. Tree-sitter in fact has exactly this capability, and this too is a native Neovim capability; we actually knew this from the very beginning. But within the reasonable abstraction above, we finally put it in the right place, rather than trying to overstep and handle every editor event.

In this abstraction, the only thing we need to do is connect the two: semantically use Tree-sitter to preserve the structure of the nodes we need, and use extmarks in the editing flow to ensure that they evolve correctly along with text-flow edits. Monkey-Patch? No such thing. Heuristic search? Not needed. Hook user keystrokes? You have overdesigned it. A truly healthy design is never a pile-up of one script, logic block, and framework after another. Instead, before the ocean of details, it asks about the deep structure behind them and thinks: beneath these tangled and complex appearances, what is truly essential? When the tangled appearances are diminished and diminished again, the remaining “non-action” is precisely where much can be done.

Back to all articles

Loading comments...