"In Python, most of the bottleneck comes from having to fetch and deserialize Redis data."
This isn't a fair comparison. Of freaking course postgres would be faster if it's not reaching out to another service.For example, in NLP a huge amount of pre and post processing of data is needed outside of the GPU.
ML algorithms get a lot focus and hype. Data retrieval, not as much.
If they wanted it to be a fair comparison they should have used FDWs to connect to the same redis and http server that the python benchmarks tested against.
* https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...
* https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
It's like if I told you to move to a place where you can walk 5mins to work, and you tell me it's not a fair comparison because right now you have to drive to the station and then get on a train and you're interested in a comparison where you walk to the train instead. You don't need the train because you're already there!
You don't need the network hops exactly because the data is already there in the right way.
although i dunno if it has good support for lots of floats. and i guess all the ml code would have to be java.
Don't you think it would be incredibly useful as a baseline if they included a third test with FDWs against redis and the http server?
So it seems like what is needed is a better host for XGBoost models instead of having to install, maintain and launch an entire database? Or am I missing something here?
Calling this Postgres vs Flask is misleading at best. It’s more like “1 tier architecture vs 2 tier architecture”
Remember, this is not plain file serving -- this is actually invoking XGBoost library which does complex mathematical operations. The user does not get data from disk, they get inference results.
Unless you know of any other solution which can invoke XGBoost (or some other inference library), I don't see anything "embarrassingly overkill" there.
The python runtime is slow in general. But anyone using it for ML is not actually using the python runtime to do any of the heavy lifting. All of the popular ML/Ai libraries for python like tensorflow, pytorch, numpy, etc. are just thin python wrappers on top of tens of thousands of lines of C/C++ code. People just use python because it's easy and there's a really good ecosystem of tools and libraries.
In fact, as far as I can tell, postgres is not running as a microservice here. The data still has to be marshalled into some output other services can use.
Basically, postgresql is a stateful service, and stateful services are always major pain to manage -- you need to back them up, migrate, think about scaling... Sometimes they are inevitable, but that does not seem to be the case here.
If you have CI/CD set up, and do frequent deploys, it will be much easier and more reproducible to include models in build artifact and have them loaded from filesystem along with the rest of the code.
The Polars homepage links to the "Database-like ops benchmark" of {Polars, data.table, DataFrames.jl, ClickHouse, cuDF*, spark, (py)datatable, dplyr, pandas, dask, Arrow, DuckDB, Modin,} but not yet PostgresML? https://h2oai.github.io/db-benchmark/
it depends on your task, if you have large language model, bottleneck likely be in ML part. It could be pre/post-processing if model is shallow.
A suggestion: clean up the blog post's charts and headers to make it much, much more clear that what's being compared isn't python vs postgresml.
- replace json (storing data as strings? really?) with a binary format like protobuf, or better yet parquet
- replace redis with duckdb for zero-copy reads
- replace pandas with polars for faster transformations
- use asynchronous, modern web framework for microservices like fastAPI
- Tune xgboost CPU resource usage with semaphores
- Multiple formats were compared
- Duckdb is not a production ready service
- Pandas isn't used
You seem to be trolling.
- Multiple formats were compared
Yes, but not a zero-copy or efficient format, like flatbuffer. It was mentioned as one of the highlights of postgresML:
> PostgresML does one in-memory copy of features from Postgres
> - Duckdb is not a production ready service
What issues did you have with duckdb? Could use some other in-memory store like Plasma if you don't like duckdb.
> - Pandas isn't used
that was responding to the point in the post:
> Since Python often uses Pandas to load and preprocess data, it is notably more memory hungry. Before even passing the data into XGBoost, we were already at 8GB RSS (resident set size); during actual fitting, memory utilization went to almost 12GB.
> You seem to be trolling.
By criticizing the blog post?
Or did you really write layers of for loops in Python?
Everything else is just speculation.
https://www.postgresql.org/docs/current/plpython.html
Naturally Rust or C functions will still be faster.
One reason for python in my eyes is maintainability: Well written python code can be easily understood and nearly as easily modified. Well written python code becomes close to what pseudo code would look like.
This is the reason python's weird jungle of dependency managment tools is so out of place for the language: It is a maintenance nightmare. I would describe myself as someone who is very able to deal with those problems, yet they are such an utter waste of time and energy.
body = request.json
key = json.dumps(body)
in the prediction code to begin with: https://github.com/postgresml/postgresml/blob/15c8488ade86b0...If python is fast enough for your case, then fair enough. And yes, it is fast enough for a lot of cases out there. Especially, for example, if you batch requests.
for reference, we're aiming for 1-100 GB / second, per server, in our python etl+ml+viz pipelines
interestingly, duckdb+polars are nice for small non-etl/ml perf, but once it's analytical processing, we use cudf / dask_cudf for much more perf per watt / $. I'd love the low overhead & typing benefits of polars, but as soon as you start looking at GB+/s and occasional bigger-than-memory, the core sw+hw needs to change a bit, end-to-end
(and if folks are into graph-based investigations, we're hiring backend/infra :) )
Sure, Python can make you start fast with any ML project, but when you have to deal with heavy-duty tasks, a switch to pure C++/Rust/Any-Compiled-Language implementations might be a good investment in terms of performance and cost-savings, especially if the above heavy tasks are done in any cloud platform
- I don't think doing one less `memcpy` will make Redis faster over the network.
- We didn't use Pandas during inference, only a Python list. You'd have to get pretty creative to do less work than that.
- That will use less CPU certainly, but I don't think it'll be faster because we still have to wait on a network resource to serve a prediction or on the GIL to deserialize the response.
- Tuning XGBoost is fun, but I don't think that's where the bottleneck is.