My other thoughts to extend this are that you could make it seamless. To start, it'll simply pipe the user's requests to OpenAI or their existing model. So it'd be a drop in replacement. Then, it'll every so often offer to the user - "hey we think at this point there's enough data that a fine tune might save you approx $x/month based on your current calls, click the button to start the fine tune and we'll email you once we have the results" - and then the user gets the email "here are the results, based on that we recommend switching, click here to switch to calling your fine-tuned model" - Helicone and the other monitoring platforms could also offer something similar. (Side note I'm working on an "ai infra handbook" aimed at technical people in software orgs looking to deploy unspecified "AI" features and trying to figure out what to do and what resources they'll need - it's a 20+ page google doc, if anyone can help me review what I have so far please let me know and I'll add you.)
If it's latency/error/speed competitive, and cheaper, and equivalently accurate, then for anyone doing production scale LLM API usage it'd make sense to use something like this - either the fine-tune is worse so you keep using the regular API, or the fine tune has parity plus cost and/or speed advantage, so you switch. (It wouldn't make sense for prototyping scale, because the additional complexity of the switch wouldn't be worth it unless it could save you 4/5 or more figures a year in API costs I'd think.)
I didn't allow them to use my output to train theirs either, so fuck 'em.