zlacker

Python is slow for ML. People will take time to realize it. The claim that most of the work is done in GPU -- covers only a small fraction of cases.

For example, in NLP a huge amount of pre and post processing of data is needed outside of the GPU.

replies(6): >>__mhar+n >>minhaz+89 >>riku_i+7c >>est+Xj >>wdroz+dw >>raverb+Oz

>>habibu+(OP)
Spacy is much faster on the GPU. Many folks don't know that Cudf (a Pandas implementation for GPUs) parallelizes string operations (these are notoriously slow on Pandas)... shrug...

replies(1): >>westur+xa

>>habibu+(OP)
> Python is slow for ML

The python runtime is slow in general. But anyone using it for ML is not actually using the python runtime to do any of the heavy lifting. All of the popular ML/Ai libraries for python like tensorflow, pytorch, numpy, etc. are just thin python wrappers on top of tens of thousands of lines of C/C++ code. People just use python because it's easy and there's a really good ecosystem of tools and libraries.

replies(1): >>madduc+eM

>>__mhar+n
Apache Ballista and Polars do Apache Arrow and SIMD.

The Polars homepage links to the "Database-like ops benchmark" of {Polars, data.table, DataFrames.jl, ClickHouse, cuDF*, spark, (py)datatable, dplyr, pandas, dask, Arrow, DuckDB, Modin,} but not yet PostgresML? https://h2oai.github.io/db-benchmark/

>>habibu+(OP)
> For example, in NLP a huge amount of pre and post processing of data is needed outside of the GPU.

it depends on your task, if you have large language model, bottleneck likely be in ML part. It could be pre/post-processing if model is shallow.

>>habibu+(OP)
That depends whether you include numpy as CPython or not.

Or did you really write layers of for loops in Python?

>>habibu+(OP)
In 2022, most people using NLP use transformers from huggingface. The tokenizer used is written in Rust and used transparently from Python.

>>habibu+(OP)
Premature optimization is an issue

If python is fast enough for your case, then fair enough. And yes, it is fast enough for a lot of cases out there. Especially, for example, if you batch requests.

>>minhaz+89
You forgot that there's also an overhead converting data from/to C++ and Python as well.

Sure, Python can make you start fast with any ML project, but when you have to deal with heavy-duty tasks, a switch to pure C++/Rust/Any-Compiled-Language implementations might be a good investment in terms of performance and cost-savings, especially if the above heavy tasks are done in any cloud platform