Ideally OTel would be more than observability, imo. Traces would be event-sources, would be a thing that begets more computing. The system to observe computing should in turn also be the system to react & respond to computing, should be begetting more computing elsewhere. That's the distributed system I want to see; local agents reporting their actions, other agents seeing that & responding. OTel would fit the need well, if only we could expand ourselves beyond thinking of observability as an operator affordance & start thinking of it as the system to track computation.
Querying unfortunately has lots of room for innovation, and it's really hard to nail down in a spec especially when the vendors all want to compete.
Being able to bring the whole application up locally should be an absolute non-negotiable.
This usually doesn't work that well for larger systems with services split between multiple teams. And it's not typically the RAM/CPU limitations that are the problem, but the amount of configuration that needs to be customized (and, in some cases, data).
Sooner or later, you just start testing with the other teams' production/staging environments rather than deal with local incompatibilities.
That's probably about the time when your development pace goes downhill.
I think it's an interesting idea to consider: If some team interfaces with something outside of its control, they need to have a mock of it. That policy increases the development effort by at least a factor of two (you always have to create the mock alongside the thing), but it's just a linear increase.
Either way, once the local version exists, then the job becomes maintaining all the infrastructure that lets you bring up the pieces, populate them with reasonable state and wire them into whatever the bits are that are being actively hacked-on.
Oh, absolutely. But at this point, your team is probably around several dozen people and you have a product with paying customers. This naturally slows the development speed, however you organize the development process.
> I think it's an interesting idea to consider: If some team interfaces with something outside of its control, they need to have a mock of it. That policy increases the development effort by at least a factor of two (you always have to create the mock alongside the thing), but it's just a linear increase.
The problem is, you can't really recapture the actual behavior of a service in a mock.
To give you an example, DynamoDB in AWS has a local mock in-memory DB for testing and development. It has nearly the same functionality, but stores all the data in RAM. So the simulated global secondary indexes (something like table views in classic SQL databases) are updated instantly. But on the real database it's eventually consistent, and it can take a fraction of a second to update.
So when you try to use your service in production, it can start breaking under the load.
Perhaps, we need better mocks that also simulate the behavior of the real services for delays, retries, and so on.
At BigCo we have migrated a number of internal things to Otel but I don’t think it has been worth the effort.
So many projects come with prometheus metrics, dashboards, and alerts out of the box that it becomes hard to use anything else. When I pick some random helm chart to install you can almost guarantee that is comes with prometheus integrations.
With grafana mimir you can now scale easily to a few billion metrics streams so a lot of the issues with the old model of prometheus have been fixed.
Like you said I don’t think there is much to innovate on in this area, which is a good thing.