zlacker

> But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it

Interesting comparison - as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

ie in the human case it doesn't matter whether it went through an intermediate neural encoding - what matters is whether the output is sufficiently similar to be deemed a copy.

Surely the same is the case of AI?

replies(4): >>omnimu+a4 >>Toucan+5i >>mystif+hZ >>Kim_Br+vI1

>>DrScie+(OP)
This whole AI learns like a human is trajectory of thought pushed by AI companies. They at same time try to humanize AI (it learns like a human would) and dehumanize humans (humans are stochastic parrots anyway). It's if anything a distraction if not straight up anti-human.

But you are right that copyright is complex and in the end decided by human (often in court). Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

On the other hand with images and art it's even more about cultural context. For example works of pop artists like Andy Warhol are for sure original works (even though some of it was disputed recently in court and lost). Nobody considers Andy Warhols work unoriginal even if it often looks very similar to some output it was riffing off because the essence is different to the original.

Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image. Also it's different to when human would want to replicate some artist style because humans can't do it 100% even if they want to. There is still piece of their "essence". There are many people who try to fake some famous artist style and sell it as real thing and simply can't do it. This is of course copyright infringement because of the intent but it's more original work than anything coming from LLMs.

replies(3): >>DrScie+Cd >>Terr_+Xh1 >>Kim_Br+LI1

>>omnimu+a4
It's both complex and extremely simple for the same reason - it's a human judgement in the end.

Just because you can't define something mathematically, doesn't mean it isn't obvious to most people in 99% of cases.

Reminds me of the endless games in tax law/avoidance/evasion and the almost pointless attempt to define something absolutely in words. To be honest you could simplify the whole thing by having a 'taking the piss' test - if the jury thinks you are obviously 'taking the piss' then you are guilty - and if you whine about the law not being clear and how it's unfair because you don't know whether or not you are breaking the law - well don't take the piss then - don't pretend you don't know whether something is an agressive tax dodge or not.

If you create some fake IP, and license it from some shell company in a low tax regime to nuke your profits in the country you are actually doing business in - let's not pretend we all can't see what you doing there - you are taking the piss.

Same goes for what some tech companies are doing right now - every reasonable person can see they are taking the piss - and high paid lawyers arguing technicalities isn't going to change that.

>>DrScie+(OP)
The difference is an image generation algorithm does not consume images the way a human does, nor reproduce them that way. If you show a human several Rembrandt's and ask them to duplicate them, you won't get exact copies, no matter how brilliant the human is: the human doesn't know how Rembrandt painted, and especially if you don't permit them to keep references, you won't get the exact painting: you'll get the elements of the original that most stuck out to them, combined with an ethereal but detectable sense of their original tastes leaking through. That's how inspiration works.

If on the other hand you ask an image generator for a Rembrandt, you'll get several usable images, and good odds a few them will be outright copies, and decent odds a few of them will be configured into an etsy or ebay product image despite you not asking for that. And the better the generator is, the better it's going to do at making really good Rembrandt style paintings, which ironically, increases the odds of it just copying a real one that appeared many times in it's training data.

People try and excuse this with explanations about how it doesn't store the images in it's model, which is true, it doesn't. However if you have a famous painting by any artist, or any work really, it's going to show up in the training data many, many times, and the more popular the artist, the more times it's going to be averaged. So if the same piece appears in lots and lots of places, it creates a "rut" in the data if you will, where the algorithm is likely going to strike repeatedly. This is why it's possible to get full copied artworks out of image generators with the right prompts.

replies(2): >>chii+Lq >>HanCli+Tr

>>Toucan+5i
> with the right prompts.

that is doing a lot of pull. Just because you could "get the full copies" with the right prompts, doesn't mean the weights and the training is copyright infringement.

I could also get a full copy of any works out of the digits of pi.

The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself. If you use the resulting model to output a copy of an existing work, then this act constitutes copyright infringement - in the exact same way that using photoshop to reproduce some works is.

What a lot of anti-ai arguments are trying to achieve is to make the act of training and model making the infringing act, and the claim is that the data is being copied while training is happening.

replies(1): >>DrScie+Px

>>Toucan+5i
We have the problem of too-perfect-recall with humans too -- even beyond artists with (near) photographic memory, there's the more common case of things like reverse-engineering.

At times, developers on projects like WINE and ReactOS use "clean-room" reverse-engineering policies [0], where -- if Developer A reads a decompiled version of an undocumented routine in a Windows DLL (in order to figure out what it does), then they are now "contaminated" and not eligible to write the open-source replacement for this DLL, because we cannot trust them to not copy it verbatim (or enough to violate copyright).

So we need to introduce a barrier of safety, where Developer A then writes a plaintext translation of the code, describing and documenting its functionality in complete detail. They are then free to pass this to someone else (Developer B) who is now free to implement an open-source replacement for that function -- unburdened by any fear of copyright violation or contamination.

So your comment has me pondering -- what would the equivalent look like (mathematically) inside of an LLM? Is there a way to do clean-room reverse-engineering of images, text, videos, etc? Obviously one couldn't use clean-room training for _everything_ -- there must be a shared context of language at some point between the two Developers. But you have me wondering... could one build a system to train an LLM from copywritten content in a way that doesn't violate copyright?

[0]: https://en.wikipedia.org/wiki/Clean-room_design

>>chii+Lq
>The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself.

Interesting point - though the law can be strange in some cases - so for example in the UK in court cases where people are effectively being charged for looking at illegal images, the actual crime can be 'making illegal images' - simply because a precedence has been set that because any OS/Browser has to 'copy' the data of any image in order someone to be able to view it - any defendent has been deemed to copied it.

Here's an example. https://www.bbc.com/news/articles/cgm7dvv128ro

So to ingest something your training model ( view ) you have by definition have had to have copied it to your computer.

replies(1): >>xp84+DJ3

>>DrScie+(OP)
Imagine I have a shit ton of data on the books people read, down to their favorite passage in each chapter.

I feed all of that into an algorithm that extracts the top n% of passages and uses NLP to string them into a semi-coherent new book. No AI or ML, just old fashioned statistics. Since my new book is comprised entirely of passages stolen wholesale from thousands of authors, clearly it's a transformative work that deserves its own copyright, and none of the original authors deserve a dime right? (/s)

What if I then feed my book through some Markov chains to mix up the wording and phrasing. Is this a new work or am I still just stealing?

AI is not magic, it does not learn. It is purely statistics extracting the top n% of other people's work.

>>omnimu+a4
> This whole AI learns like a human is trajectory of thought pushed by AI companies.

My retort towards the " it would be legal if a human did it" argument is that if the model gets personhood then those companies are guilty of enslaving children.

> Compare that to pepople prompting directly with name of artist they want to replicate.

In that case, I would emphasize that the infringement is being done by the model, It's not illegal or infringing to ask for an unlicensed copyright infringing work. (Although it might become that way, if big corporations start lobbying for it.)

>>DrScie+(OP)
> as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

> Surely the same is the case of AI?

That's close to my position.

Also, consider the case where you want to ask an image generator to not infringe copyright by eg saying "make the character look less like Donald Duck". In which case, the image generator still needs to know what Donald Duck looks like!

>>omnimu+a4
> Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

Actually if you rewrite it in a different language, you're well on your way to making it an independent expression; (though beware Structure, Sequence and Organization, unless you're implementing an API : See Google v. Oracle). Copyright protects specific expressions, not functionality.

> Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image.

As far as I'm aware an artists' style is not something that is protected by law, Copyright protects specific works.

If you did want to protect artistic styles, how would you go about legally defining them?

replies(2): >>omnimu+xh2 >>omnimu+ej2

>>Kim_Br+LI1
I dont believe rewrite in different language is specific expression.

We will see because we are well on our way of LLMs being able to translate whole codebases to different stack without a hitch. If thats OK than any of the copyleft, open-core or leaked codebases are up for grabs.

replies(1): >>Kim_Br+Iw2

>>Kim_Br+LI1
The fact LLMs are generating any images is purely thanks to database of source images that are copyright protected. Its a form of sophisticated automated photobashing. Photobashing is grayzone but often legal because of the other artist doing the (often original) work.

When you prompt for Mijazaki image this image can only exist thanks to his protected work being in database (where he doesnt want to be) otherwise the user wouldnt get Mijazaki image they wanted.

We will see how that all plays out but i think if Mijazaki took this to court there would be solid case on grounds that the resulting images breach the copyright of the source, are not original works and are created with bad intent that goes against protections of original author.

What seems to be current direction is atleast that the resulting images cannot be copyrighted automatically in public domain. Making it difficult to use commercially.

replies(2): >>Kim_Br+uB2 >>Kim_Br+RE2

>>omnimu+xh2
A hand rewrite (or intelligent rewrite in general) will tend to become unique pretty quickly, especially when you start leaning into language features of the target language for improved efficiency. Your Structure and Organization will be different.

If you order an LLM (or a human) to do a straight 1:1 translation, you'll sort of pass one test (it's a completely different language after all!), but fail to show much difference wrt structure, sequence or organization. I'm also not entirely sure how good of an idea it is technically. If you start iterating on it you can probably get much better results anyway. But then you're doing real creative work!

>>omnimu+ej2
There's no such database, AFAICT.

If you've ever worked with open source models (eg one of the stable diffusion models or models based on them, using tools such as AUTOMATIC1111 or ComfyUI); you can inspect them yourself and simply see. If you haven't done so already, see if you can figure out the installation instructions for one of the tools and try!

Meanwhile ...

Ok, fine, I've heard some crazy compression conspiracy theories, but they're a bit too crazy to be credible.

I've also heard stories about these models being intelligent - a little artist living in your computer. I think that's going a bit too far in another direction.

In reality, I think it's better to install the software and take your time to learn about the way these models are actually built and work.

[ btw: If Miyazaki were to take this to court with the argument you put forward, he wouldn't get very far. "Please remove my images from your systems in whatever form you are holding them". The response for the defense would simply be: "We don't actually have them, and you are quite welcome to inspect all our systems". ]

(Incidentally, I've been here before. I play with synths as a hobby! ;-)

>>omnimu+ej2
Actually, while I just said "there is no database", maybe you're working from a very different mental model from mine...

What do you mean by "Database" in this context? What information do you think is being stored, (and how?)

replies(1): >>omnimu+PZ2

>>Kim_Br+RE2
I understand what the model is and how you get to it. I know the training data is not stored. But as far as i understand - the model is closer to derived intermediary from the training data. Like database index or like you said form of compression.

Thats why i on purpose tend to call trainng data + model the database. Because to non progammers it makes more sense. To me there is intentional slight of hand of hiding the fact that the only reason LLMs can work as they do now is because of the source data. The way its usually marketed it seems like the model is program that generalised principles of drawing from looking and other drawings thats why it can draw like Mijazaki when it wants to. Not that it can draw Mijazaki because it preprocessed every Mijazaki drawing, stemmed patterns out of it and can mash them with other patterns (from the database).

Thats why i intentionally say database to lead this discussions back to what i see is core of these technologies.

replies(1): >>chii+1S3

>>DrScie+Px
That seems to be an artifact of the whole copyright thing predating all forms of computing and memory, but if we don’t ignore that one, we’ve all been illegally copying copyrighted text, images and videos into our RAM every time we use the Internet. So i think the courts now basically acknowledge that that doesnt count as a “copy.”

*Not a lawyer

replies(1): >>DrScie+VA6

>>omnimu+PZ2
What you're describing as database would be what i call information.

>>xp84+DJ3
Expect I've given you a concrete real counter example of where they do treat copying in memory as 'making a copy'.