zlacker

> Meta "sublicensed" my image to someone else, I wouldn't have much to say there.

but you agreed to this, when agreeing to the TOS.

> I post a picture on my site that is then used by a large publisher for ads, I would (at least in theory) have some recourse

which you didn't sign any contract, and therefore it is a violation of copyright.

But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it (which you've implicitly given permission to do so, by putting it up on the internet). On the other hand, if you put it behind a gate (no matter how trivial), then you could've at least legally protected yourself.

replies(6): >>immibi+3d >>ehnto+ed >>Pittle+Ce >>entrop+Yj >>DrScie+oD >>Terr_+cV1

>>chii+(OP)
> but you agreed to this

Yes, that was the point? You agree to this by using Meta. So don't.

>>chii+(OP)
Strong disagree on the last paragraph. It's data online, your data, and it was used for commercial purposes without your consent.

In fact, I never consented for anyone to access my server. Just because it has an IP address, does not make it a public service.

Obviously in a practical sense that is a silly position to take, and in prior cases there is usually an extenuating factor that got the person charged, eg breaking through access controls, violating ToS, or intellectual property violations.

But I don't rescind the prior statement. Just because I have an address doesn't mean you can come in through any unlocked doors.

replies(2): >>ahtihn+Ff >>yencab+wq1

>>chii+(OP)
> but you agreed to this, when agreeing to the TOS

The legal definition of agreement means basically zilch

>>ehnto+ed
> In fact, I never consented for anyone to access my server. Just because it has an IP address, does not make it a public service.

If you don't take any steps to make it clear that it's not public, like an auth wall or putting pages on unguessable paths, then it is public, because that is what everyone expects.

Just like you if you have a storefront, if the door is unlocked you'd expect people to just come in and no one would take you seriously if you complain that people keep coming in if you don't somehow make it clear that they're not supposed to.

replies(1): >>DrScie+1E

>>chii+(OP)
>But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it (which you've implicitly given permission to do so, by putting it up on the internet).

I don't understand how that matters. I thought that the whole idea of copyright and licences was that the holder of the rights can decide what is ok to do with the content and what is not. If the holder of the rights does not agree to a certain kind of use, what else is there to discuss?

It sure does not matter if I think that downloading a torrent is not any more pirating than borrowing a media from my friend.

replies(2): >>chii+Sn >>Terr_+6W1

>>entrop+Yj
> If the holder of the rights does not agree to a certain kind of use, what else is there to discuss?

the holder of content does not automatically get to prescribe how i would use said content, as long as i comply with the copyrights.

The holder does not get to dictate anything beyond that - for example, i can learn from the content. Or i can berate it. Copyright is not a right that covers every single conceivable use - it is a limited set of uses that have been outlayed in the law.

So the current arguments center on the fact that it is unknown if existing copyright covers the use of said works in ML training.

replies(2): >>chroma+Er >>TheOth+uu

>>chii+Sn
yeah, it is called _copy_ right. The question is, if AI is making obfuscated copies or not.

interestingly in German it is not called copyright, but Urheberrecht "authors rights". So there the word itself implies more things.

BTW at least in Germany you can own image rights of your art piece or building that is placed in a public place.

>>chii+Sn
Copyright means the holder does automatically get to prescribe how content can be copied. That's literally the definition of copyright.

A typical copyright notice for a book says something like (to paraphrase...) "not to be stored, transmitted, or used by or on any electronic device without explicit permission."

That clearly includes use for training, because you can't train without making a copy, even if the copy is subsequently thrown away.

Any argument about this is trying to redefine copyright as the right to extract the semantic or cultural value of a document. In reality the definition is already clear - no copying of a document by any means for any purpose without explicit permission.

This is even implicitly acknowledged in the CC definitions. CC would be meaningless and pointless without it.

replies(3): >>rcxdud+Cx >>chii+vz >>rpdill+S91

>>TheOth+uu
This a particularly extreme interpretation of copyright, and not one that has seen that much support in the courts. You can put what you like in a copyright notice or license, but it doesn't mean it'll succeed, and the courts have generally taken a dim view of any argument which relied on the fact that electronic data is technically copied many times just to make it viewable to a user. Copyright is probably better understood as distribution rights.

(Not saying training will necessarily fall in the same boat, just saying that the view 'copying to a screen or over the internet is necessarily a copy for the purposes of copyright' is reductive to the point of being outright incorrect)

>>TheOth+uu
> That clearly includes use for training, because you can't train without making a copy, even if the copy is subsequently thrown away.

a copy for ingestion purposes - such as viewing in a browser, is not the same as a distribution copy that you make sending it to another person.

> the right to extract the semantic or cultural value of a document.

this right does not belong to the author - in fact, this is not an explicit right granted by the copyright act. Therefore, the extraction of information from a works is not something the author can (nor should) control. Otherwise, how would anyone learn off a textbook, music or art?

In the future, when the courts finally decide what the limits of ML training is, may be it will be a new right granted to authors. But it isn't one atm.

>>chii+(OP)
> But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it

Interesting comparison - as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

ie in the human case it doesn't matter whether it went through an intermediate neural encoding - what matters is whether the output is sufficiently similar to be deemed a copy.

Surely the same is the case of AI?

replies(4): >>omnimu+yH >>Toucan+tV >>mystif+FC1 >>Kim_Br+Tl2

>>ahtihn+Ff
Your shop might be open sure - but aren't we talking about people coming in and taking whatever they like for free?

ie if you were an art gallery, the expectation would be people could come in and look, but you don't expect them to come in, photograph everything and then sell prints of everything online.

replies(1): >>chii+x11

>>DrScie+oD
This whole AI learns like a human is trajectory of thought pushed by AI companies. They at same time try to humanize AI (it learns like a human would) and dehumanize humans (humans are stochastic parrots anyway). It's if anything a distraction if not straight up anti-human.

But you are right that copyright is complex and in the end decided by human (often in court). Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

On the other hand with images and art it's even more about cultural context. For example works of pop artists like Andy Warhol are for sure original works (even though some of it was disputed recently in court and lost). Nobody considers Andy Warhols work unoriginal even if it often looks very similar to some output it was riffing off because the essence is different to the original.

Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image. Also it's different to when human would want to replicate some artist style because humans can't do it 100% even if they want to. There is still piece of their "essence". There are many people who try to fake some famous artist style and sell it as real thing and simply can't do it. This is of course copyright infringement because of the intent but it's more original work than anything coming from LLMs.

replies(3): >>DrScie+0R >>Terr_+lV1 >>Kim_Br+9m2

>>omnimu+yH
It's both complex and extremely simple for the same reason - it's a human judgement in the end.

Just because you can't define something mathematically, doesn't mean it isn't obvious to most people in 99% of cases.

Reminds me of the endless games in tax law/avoidance/evasion and the almost pointless attempt to define something absolutely in words. To be honest you could simplify the whole thing by having a 'taking the piss' test - if the jury thinks you are obviously 'taking the piss' then you are guilty - and if you whine about the law not being clear and how it's unfair because you don't know whether or not you are breaking the law - well don't take the piss then - don't pretend you don't know whether something is an agressive tax dodge or not.

If you create some fake IP, and license it from some shell company in a low tax regime to nuke your profits in the country you are actually doing business in - let's not pretend we all can't see what you doing there - you are taking the piss.

Same goes for what some tech companies are doing right now - every reasonable person can see they are taking the piss - and high paid lawyers arguing technicalities isn't going to change that.

>>DrScie+oD
The difference is an image generation algorithm does not consume images the way a human does, nor reproduce them that way. If you show a human several Rembrandt's and ask them to duplicate them, you won't get exact copies, no matter how brilliant the human is: the human doesn't know how Rembrandt painted, and especially if you don't permit them to keep references, you won't get the exact painting: you'll get the elements of the original that most stuck out to them, combined with an ethereal but detectable sense of their original tastes leaking through. That's how inspiration works.

If on the other hand you ask an image generator for a Rembrandt, you'll get several usable images, and good odds a few them will be outright copies, and decent odds a few of them will be configured into an etsy or ebay product image despite you not asking for that. And the better the generator is, the better it's going to do at making really good Rembrandt style paintings, which ironically, increases the odds of it just copying a real one that appeared many times in it's training data.

People try and excuse this with explanations about how it doesn't store the images in it's model, which is true, it doesn't. However if you have a famous painting by any artist, or any work really, it's going to show up in the training data many, many times, and the more popular the artist, the more times it's going to be averaged. So if the same piece appears in lots and lots of places, it creates a "rut" in the data if you will, where the algorithm is likely going to strike repeatedly. This is why it's possible to get full copied artworks out of image generators with the right prompts.

replies(2): >>chii+941 >>HanCli+h51

>>DrScie+1E
That's not what's happening.

Instead, it's that there's some people coming into your gallery, studying the art and its style, and leaving with the learned information. They then replicate that style in their own gallery. Of course, none of the images are copies, or would be judged to be copies by a reasonable person.

So now you, the gallery owner, want to forbid just those people who would come to learn the style. But you still want people to come and admire the art, and may be buy a print.

replies(1): >>DrScie+q21

>>chii+x11
> Of course, none of the images are copies, or would be judged to be copies by a reasonable person.

That's the fiction of course.

Tell me how something like ChatGPT can simultaneously claim to return accurate information while at the same time being completely independent from the sources of the information?

In terms of images - copyright isn't only for exact copies - it if was then humans would have been taking the piss by making minor changes for decades.

Sure you could argue some is fair use with genuinely original content being produced in the process, but I think you are also overlooking an important part of what's considered 'fair' - industrialised copying of source material isn't really the same in terms of fairness as one person getting inspiration.

Taking the Encylopedia Britanica and running it though an algorithm to change the wording, but not the meaning, and selling it on is really not the same as a student reading it and including those facts in their essay - the latter is considered fair use, the former is taking the piss.

replies(1): >>chii+X51

>>Toucan+tV
> with the right prompts.

that is doing a lot of pull. Just because you could "get the full copies" with the right prompts, doesn't mean the weights and the training is copyright infringement.

I could also get a full copy of any works out of the digits of pi.

The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself. If you use the resulting model to output a copy of an existing work, then this act constitutes copyright infringement - in the exact same way that using photoshop to reproduce some works is.

What a lot of anti-ai arguments are trying to achieve is to make the act of training and model making the infringing act, and the claim is that the data is being copied while training is happening.

replies(1): >>DrScie+db1

>>Toucan+tV
We have the problem of too-perfect-recall with humans too -- even beyond artists with (near) photographic memory, there's the more common case of things like reverse-engineering.

At times, developers on projects like WINE and ReactOS use "clean-room" reverse-engineering policies [0], where -- if Developer A reads a decompiled version of an undocumented routine in a Windows DLL (in order to figure out what it does), then they are now "contaminated" and not eligible to write the open-source replacement for this DLL, because we cannot trust them to not copy it verbatim (or enough to violate copyright).

So we need to introduce a barrier of safety, where Developer A then writes a plaintext translation of the code, describing and documenting its functionality in complete detail. They are then free to pass this to someone else (Developer B) who is now free to implement an open-source replacement for that function -- unburdened by any fear of copyright violation or contamination.

So your comment has me pondering -- what would the equivalent look like (mathematically) inside of an LLM? Is there a way to do clean-room reverse-engineering of images, text, videos, etc? Obviously one couldn't use clean-room training for _everything_ -- there must be a shared context of language at some point between the two Developers. But you have me wondering... could one build a system to train an LLM from copywritten content in a way that doesn't violate copyright?

[0]: https://en.wikipedia.org/wiki/Clean-room_design

>>DrScie+q21
> ChatGPT can simultaneously claim to return accurate information while at the same time being completely independent from the sources of the information?

why can't that be true? Information is not copyrightable. The expression of information is. If chatGPT extracted information from a source works, and represent that information back to you in a form that is not a copy of the original works, then this is completely fine to me. An example would be a recipe.

replies(1): >>DrScie+981

>>chii+X51
So you think taking something like the Encylopedia Britanica, running it through a simple rewording algorithm, and selling it on is totally 'fair use'?

Taking all newspaper and proper journalistic output and rewording it automatically and selling it on is also 'fair use'?

Stand back from the detail ( of whether this pixel or word is the same or not ) and look at the bigger picture. You still telling me that's all fine and dandy?

I think it's obviously not 'fair use'.

It means the people doing the actual hard graft of gathering the news, or writing Encylopedias or Textbooks won't be able to make a living so these important activities will cease.

This is exactly the scenario copyright etc exists to stop.

replies(1): >>chii+EA2

>>TheOth+uu
> Any argument about this is trying to redefine copyright as the right to extract the semantic or cultural value of a document. In reality the definition is already clear - no copying of a document by any means for any purpose without explicit permission.

I've studied copyright for over 20 years as an amateur, and I used to very much think this way.

And then I started reading court decisions about copyright, and suddenly it became extremely clear that it's a very nuanced discussion about whether or not the document can be copied without explicit permission. There are tons of cases where it's perfectly permissible, even if the copyright holder demands that you request permission.

I've covered this in other posts on Hacker News, but it is still my belief that we will ultimately find AI training to be fair use because it does not materially impact the market for the original work. Perhaps someone could bring a case that makes the case that it does, but courts have yet to see a claim that asserts this in a convincing way based on my reading of the cases over the past couple of years.

replies(1): >>Terr_+FW1

>>chii+941
>The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself.

Interesting point - though the law can be strange in some cases - so for example in the UK in court cases where people are effectively being charged for looking at illegal images, the actual crime can be 'making illegal images' - simply because a precedence has been set that because any OS/Browser has to 'copy' the data of any image in order someone to be able to view it - any defendent has been deemed to copied it.

Here's an example. https://www.bbc.com/news/articles/cgm7dvv128ro

So to ingest something your training model ( view ) you have by definition have had to have copied it to your computer.

replies(1): >>xp84+1n4

>>ehnto+ed
Content is often publicly available and copyright protected. Paint a mural near a busy street. No locked door in that metaphor; locked door would be password protected site.

>>DrScie+oD
Imagine I have a shit ton of data on the books people read, down to their favorite passage in each chapter.

I feed all of that into an algorithm that extracts the top n% of passages and uses NLP to string them into a semi-coherent new book. No AI or ML, just old fashioned statistics. Since my new book is comprised entirely of passages stolen wholesale from thousands of authors, clearly it's a transformative work that deserves its own copyright, and none of the original authors deserve a dime right? (/s)

What if I then feed my book through some Markov chains to mix up the wording and phrasing. Is this a new work or am I still just stealing?

AI is not magic, it does not learn. It is purely statistics extracting the top n% of other people's work.

>>chii+(OP)
If they aren't a violation of copyright, then I want to see what happens when people are trading around models and "prompts" that describe recently released movies and music sufficiently that it competes with the original.

Not necessarily because I like either "we monetize public work" or "copyright robber-barons", but I'd like at least one of them to clearly lose so that the rest of us have clear and fair rules to work with.

>>omnimu+yH
> This whole AI learns like a human is trajectory of thought pushed by AI companies.

My retort towards the " it would be legal if a human did it" argument is that if the model gets personhood then those companies are guilty of enslaving children.

> Compare that to pepople prompting directly with name of artist they want to replicate.

In that case, I would emphasize that the infringement is being done by the model, It's not illegal or infringing to ask for an unlicensed copyright infringing work. (Although it might become that way, if big corporations start lobbying for it.)

>>entrop+Yj
Not quite, it is (at least in the US) a limited privilege to control the copying and reproduction.

If you make a movie poster, and it goes out into the market, and then someone picks it up from a garage sale, copyright still applies, they can't just make tons of duplicates.

But you can't use copyright to force them to display it right side up instead of upside down, to not write on it, to not burn it, and to not make it into a bizarre pseudosexual shrine in their basement.

>>rpdill+S91
I assume the emphasis there is on training, whereas it's totally possible to infringe by running the model in certain ways later.

replies(1): >>rpdill+Ce2

>>Terr_+FW1
Agreed! My take is that usages still can infringe if the output produced would otherwise infringe. I would take the fact that you use AI as the particular tool to accomplish the infringement as incidental.

>>DrScie+oD
> as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

> Surely the same is the case of AI?

That's close to my position.

Also, consider the case where you want to ask an image generator to not infringe copyright by eg saying "make the character look less like Donald Duck". In which case, the image generator still needs to know what Donald Duck looks like!

>>omnimu+yH
> Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

Actually if you rewrite it in a different language, you're well on your way to making it an independent expression; (though beware Structure, Sequence and Organization, unless you're implementing an API : See Google v. Oracle). Copyright protects specific expressions, not functionality.

> Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image.

As far as I'm aware an artists' style is not something that is protected by law, Copyright protects specific works.

If you did want to protect artistic styles, how would you go about legally defining them?

replies(2): >>omnimu+VU2 >>omnimu+CW2

>>DrScie+981
> Taking all newspaper and proper journalistic output and rewording it automatically and selling it on is also 'fair use'?

it would be, if the transformation is substantial. If you're just asking for snippets of existing written works, then those snippets are merely derivative works.

For example, if you asked an LLM to summarize the news and stories of 2024, i reckon the output is not infringing. Because the informational contents of the news is not itself copyrightable, only the article itself. A summary, which contains a precis of the information, but not the original expression, is surely uncopyrightable - esp. if it is a small minority of the source (e.g., chatGPT used millions of sources).

> won't be able to make a living so these important activities will cease.

this is irrelevant as far as i'm concerned. They being able to make or not make a living is orthogonal. If they can't, then they should stop.

replies(1): >>DrScie+ia7

>>Kim_Br+9m2
I dont believe rewrite in different language is specific expression.

We will see because we are well on our way of LLMs being able to translate whole codebases to different stack without a hitch. If thats OK than any of the copyleft, open-core or leaked codebases are up for grabs.

replies(1): >>Kim_Br+6a3

>>Kim_Br+9m2
The fact LLMs are generating any images is purely thanks to database of source images that are copyright protected. Its a form of sophisticated automated photobashing. Photobashing is grayzone but often legal because of the other artist doing the (often original) work.

When you prompt for Mijazaki image this image can only exist thanks to his protected work being in database (where he doesnt want to be) otherwise the user wouldnt get Mijazaki image they wanted.

We will see how that all plays out but i think if Mijazaki took this to court there would be solid case on grounds that the resulting images breach the copyright of the source, are not original works and are created with bad intent that goes against protections of original author.

What seems to be current direction is atleast that the resulting images cannot be copyrighted automatically in public domain. Making it difficult to use commercially.

replies(2): >>Kim_Br+Se3 >>Kim_Br+fi3

>>omnimu+VU2
A hand rewrite (or intelligent rewrite in general) will tend to become unique pretty quickly, especially when you start leaning into language features of the target language for improved efficiency. Your Structure and Organization will be different.

If you order an LLM (or a human) to do a straight 1:1 translation, you'll sort of pass one test (it's a completely different language after all!), but fail to show much difference wrt structure, sequence or organization. I'm also not entirely sure how good of an idea it is technically. If you start iterating on it you can probably get much better results anyway. But then you're doing real creative work!

>>omnimu+CW2
There's no such database, AFAICT.

If you've ever worked with open source models (eg one of the stable diffusion models or models based on them, using tools such as AUTOMATIC1111 or ComfyUI); you can inspect them yourself and simply see. If you haven't done so already, see if you can figure out the installation instructions for one of the tools and try!

Meanwhile ...

Ok, fine, I've heard some crazy compression conspiracy theories, but they're a bit too crazy to be credible.

I've also heard stories about these models being intelligent - a little artist living in your computer. I think that's going a bit too far in another direction.

In reality, I think it's better to install the software and take your time to learn about the way these models are actually built and work.

[ btw: If Miyazaki were to take this to court with the argument you put forward, he wouldn't get very far. "Please remove my images from your systems in whatever form you are holding them". The response for the defense would simply be: "We don't actually have them, and you are quite welcome to inspect all our systems". ]

(Incidentally, I've been here before. I play with synths as a hobby! ;-)

>>omnimu+CW2
Actually, while I just said "there is no database", maybe you're working from a very different mental model from mine...

What do you mean by "Database" in this context? What information do you think is being stored, (and how?)

replies(1): >>omnimu+dD3

>>Kim_Br+fi3
I understand what the model is and how you get to it. I know the training data is not stored. But as far as i understand - the model is closer to derived intermediary from the training data. Like database index or like you said form of compression.

Thats why i on purpose tend to call trainng data + model the database. Because to non progammers it makes more sense. To me there is intentional slight of hand of hiding the fact that the only reason LLMs can work as they do now is because of the source data. The way its usually marketed it seems like the model is program that generalised principles of drawing from looking and other drawings thats why it can draw like Mijazaki when it wants to. Not that it can draw Mijazaki because it preprocessed every Mijazaki drawing, stemmed patterns out of it and can mash them with other patterns (from the database).

Thats why i intentionally say database to lead this discussions back to what i see is core of these technologies.

replies(1): >>chii+pv4

>>DrScie+db1
That seems to be an artifact of the whole copyright thing predating all forms of computing and memory, but if we don’t ignore that one, we’ve all been illegally copying copyrighted text, images and videos into our RAM every time we use the Internet. So i think the courts now basically acknowledge that that doesnt count as a “copy.”

*Not a lawyer

replies(1): >>DrScie+je7

>>omnimu+dD3
What you're describing as database would be what i call information.

>>chii+EA2
> this is irrelevant as far as i'm concerned. They being able to make or not make a living is orthogonal. If they can't, then they should stop.

It's not orthogonal - it's central. Copyright law and IP law isn't some abstract thing - it's a law with a purpose - to protect people from having their work ripped off in a way that they can no longer work.

If journalists can't gather the news, then sure events still happen but Google et al won't be able to summarise them as they will be no reports.

If scientific journals can no longer afford to operate as nobody needs to subscribe because anybody can get the content free via a rip-off, then there will be no scientific journals to rip-off.

Surely stealing stuff and selling it on is convenient for both big tech and consumers - but it's not a sustainable economic model.

>>xp84+1n4
Expect I've given you a concrete real counter example of where they do treat copying in memory as 'making a copy'.