Show HN: C discrete event SIM w stackful coroutines runs 45x faster than SimPy

submitted by ambonv+(OP) on 2026-02-03 16:09:07 | 68 points 18 comments
[view article] [source] [go to bottom]

Hi all,

I have built Cimba, a multithreaded discrete event simulation library in C.

Cimba uses POSIX pthread multithreading for parallel execution of multiple simulation trials, while coroutines provide concurrency inside each simulated trial universe. The simulated processes are based on asymmetric stackful coroutines with the context switching hand-coded in assembly.

The stackful coroutines make it natural to express agentic behavior by conceptually placing oneself "inside" that process and describing what it does. A process can run in an infinite loop or just act as a one-shot customer passing through the system, yielding and resuming execution from any level of its call stack, acting both as an active agent and a passive object as needed. This is inspired by my own experience programming in Simula67, many moons ago, where I found the coroutines more important than the deservedly famous object-orientation.

Cimba turned out to run really fast. In a simple benchmark, 100 trials of an M/M/1 queue run for one million time units each, it ran 45 times faster than an equivalent model built in SimPy + Python multiprocessing. The running time was reduced by 97.8 % vs the SimPy model. Cimba even processed more simulated events per second on a single CPU core than SimPy could do on all 64 cores.

The speed is not only due to the efficient coroutines. Other parts are also designed for speed, such as a hash-heap event queue (binary heap plus Fibonacci hash map), fast random number generators and distributions, memory pools for frequently used object types, and so on.

The initial implementation supports the AMD64/x86-64 architecture for Linux and Windows. I plan to target Apple Silicon next, then probably ARM.

I believe this may interest the HN community. I would appreciate your views on both the API and the code. Any thoughts on future target architectures to consider?

Docs: https://cimba.readthedocs.io/en/latest/

Repo: https://github.com/ambonvik/cimba

NOTE: showing posts with links only show all posts

>>anemat+JM
It is an assembly function that does not get called from anywhere. I pre-load the stack image with its intended register content from C, including the trampoline function address as the "return address". On the first transfer to the newly created coroutine, that gets loaded, which in turn calls the coroutine function that suddenly is in one of its registers along with its arguments. If the coroutine function ever returns, that just continues the trampoline function, which proceeds to call the coroutine_exit() function, whose address also just happens to be stored in another handy register.

https://github.com/ambonvik/cimba/blob/main/src/port/x86-64/...

>>ambonv+(OP)
How does this compare to Mojo?

Edit: nevermind. I answered the question for myself w/ vibe coding: https://pastes.io/mojo-event.

Workers: 1 | 2 | 4 | 8

Time: 12.01s | 8.01s | 5.67s | 5.49s

Events/sec: 16.7M | 25.0M | 35.3M | 36.4M

Obviously just a first pass & not optimized b/c I'm not a mojo expert.

>>measur+I81
I am not familiar with Mojo, so I do not know.

Compared to the coroutine implementations I do know, none of them quite met what I as looking for. The «trampoline» has been mentioned. I also needed a calling convention that fit the higher-level process model with a self pointer and a context, and a signal return value from each context switch. It also has to be thread safe to survive the pthreads. Not very difficult to do, but needs to be designed in from the beginning.

Same thing with random number generators. It will not work very well to keep state between calls in a static local variable or some global construction, needs to be kept thread local somewhere. Not difficult, but needs to be there from the start both for the basic generator and for the distributions on top of it.

Quite a bit more here: https://cimba.readthedocs.io/en/latest/background.html#

zlacker

Show HN: C discrete event SIM w stackful coroutines runs 45x faster than SimPy