zlacker

[parent] [thread] 131 comments
1. rpcope+(OP)[view] [source] 2025-01-03 04:07:47
> Exploiting user-generated content.

You know, if I've noticed anything in the past couple years, it's that even if you self-host your own site, it's still going to get hoovered up and used/exploited by things like AI training bots. I think between everyone's code getting trained on, even if it's AGPLv3 or something similarly restrictive, and generally everything public on the internet getting "trained" and "transformed" to basically launder it via "AI", I can absolutely see why someone rational would want to share a whole lot less, anywhere, in an open fashion, regardless of where it's hosted.

I'd honestly rather see and think more about how to segment communities locally, and go back to the "fragmented" way things once were. It's easier to want to share with other real people than inadvertently working for free to enrich companies.

replies(18): >>dend+l >>panark+n >>Camper+42 >>matheu+v2 >>Terr_+Z4 >>dehrma+Aa >>ehnto+qd >>immibi+Rd >>baxtr+2e >>TeMPOr+Gh >>blahbl+gj >>alibar+sj >>cxr+Au >>sneak+yB >>pixelm+M61 >>cousco+6f1 >>foxgla+zn1 >>hulitu+315
2. dend+l[view] [source] 2025-01-03 04:10:51
>>rpcope+(OP)
Nothing to disagree in this statement, for sure. If it's on the open internet, it will almost surely be used for AI training, consent be damned. But it feels like even at a rudimentary level, if I post a picture on my site that is then used by a large publisher for ads, I would (at least in theory) have some recourse to pursue the matter and prevent them from using my content.

In contrast, if I uploaded something to a social media site like Instagram, and then Meta "sublicensed" my image to someone else, I wouldn't have much to say there.

Would love someone with actual legal knowledge to chime in here.

replies(1): >>chii+R
3. panark+n[view] [source] 2025-01-03 04:11:01
>>rpcope+(OP)
> how to segment communities locally

So it's not about owning vs. renting property on the internet, it's about controlling the roads that connect the properties so you can keep the world out of your community.

replies(1): >>Nitpic+Yd
◧◩
4. chii+R[view] [source] [discussion] 2025-01-03 04:15:54
>>dend+l
> Meta "sublicensed" my image to someone else, I wouldn't have much to say there.

but you agreed to this, when agreeing to the TOS.

> I post a picture on my site that is then used by a large publisher for ads, I would (at least in theory) have some recourse

which you didn't sign any contract, and therefore it is a violation of copyright.

But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it (which you've implicitly given permission to do so, by putting it up on the internet). On the other hand, if you put it behind a gate (no matter how trivial), then you could've at least legally protected yourself.

replies(6): >>immibi+Ud >>ehnto+5e >>Pittle+tf >>entrop+Pk >>DrScie+fE >>Terr_+3W1
5. Camper+42[view] [source] 2025-01-03 04:29:00
>>rpcope+(OP)
You know, if I've noticed anything in the past couple years, it's that even if you self-host your own site, it's still going to get hoovered up and used/exploited by things like AI training bots.

So? What do I care? If some stuff I posted to my website (with no requirement for attribution or remuneration, and also no guarantee that the information is true or valid) can improve the AI services that I use, great.

replies(3): >>liontw+R5 >>bulatb+47 >>Anthon+6a
6. matheu+v2[view] [source] 2025-01-03 04:34:09
>>rpcope+(OP)
> I can absolutely see why someone rational would want to share a whole lot less, anywhere, in an open fashion, regardless of where it's hosted.

I've reached the same conclusion.

All data is just bits. Numbers. Once it's out there, trying to control their spread and use is just delusional. People should just stop sharing things publicly. Even things like AGPLv3 are proving to be ineffective against their exploitation.

I really didn't expect to live in this "copyright for me, not for thee" world. The same corporations that compare us mere mortals to high seas pirates when we infringe their copyrights are now getting caught shamelessly AI laundering the copyrights of others on an industrial scale.

It's so demoralizing. I feel like giving up and just going private. Problem is I also want to share the things I made. To talk about my projects with real people. Programming is lonely enough as it is. Without sharing I'm not sure what the point even is. I have no idea what I'm supposed to do from now on. I just know I don't want to end up working for free to enrich trillion dollar corporations.

replies(5): >>dend+q3 >>bulatb+s4 >>alison+J8 >>immibi+ne >>pjc50+Nu
◧◩
7. dend+q3[view] [source] [discussion] 2025-01-03 04:43:11
>>matheu+v2
I can relate to the sentiment. For what it's worth, I also know that if someone's personal site/repos/pictures are used to train AI, they have no recourse short of said person having TONS of money to go and fight the legal battles similar to how media companies do.

But you know what, I grew up in a family of educators whose whole life mission was to help others by sharing their knowledge. That's what I am doing through my blog. I learned something? Blog about it. I built a reverse-engineered wrapper over some API? Share it openly. For every AI ingress job over this content there will be a few people that will read my code or blog post and either learn from it, be inspired, ignore it, or unblock themselves from a problem that they tried to solve. I think that makes the effort worth it to me.

For what it's worth, even before AI emerged, I've seen sites that would shamelessly rip off my content and re-publish it on their own domains under a different author. One even tried charging people for it. On several occasions I fought it and won with the help of Google/Bing. Other times, nothing happened. And that's fine. Such is the fate of online content. If my content helped at least one person, it was worth sharing it in the open.

◧◩
8. bulatb+s4[view] [source] [discussion] 2025-01-03 04:55:30
>>matheu+v2
Yeah, I hear this. Anything I put online is feeding the machine that will replace me.

Maybe I can carve myself a niche if I can find an audience, and maybe turn that into something kind of reward-shaped, but that's not happening without me feeding the machine. And almost certainly I won't succeed, and I'll just make it harder for myself and everyone like me to succeed in the future.

It seems the only thing to do is do it anyway and try to be unique enough to make it work. And somehow just be fine with pulling up the ladder behind you.

replies(2): >>matheu+W5 >>sneak+WB
9. Terr_+Z4[view] [source] 2025-01-03 05:00:18
>>rpcope+(OP)
IANAL, but lately I've had this quixotic daydream of a combination accept-cookies / agree-to-TOS page that comes up, and the Terms of Service says by proceeding they agree to give the site-owner an perpetual, irrevocable, and royalty-free to use and re-license any future content that they create using any generative AI that was trained using the website contents.

Then you carefully log what LLM user-agents/IPs go past that agree, along with some very distinctive secretly crawlable pages which have contents that can be distinctively reproduced back out of the model if needed.

Then whenever SomeShittyLLM posts "articles", everybody with that TOS that was crawled gets to duplicate it without ads for free. :P

replies(2): >>spirit+99 >>w4+Qa
◧◩
10. liontw+R5[view] [source] [discussion] 2025-01-03 05:08:34
>>Camper+42
Even if no attribution etc is your personal policy that’s not everyone else’s.

The end result is that any authors who care about copyright protection will become less accessible. It’s a gold rush for AI bots to capture the good will of early internet creators before the well runs dry,

replies(2): >>dend+37 >>Camper+5n1
◧◩◪
11. matheu+W5[view] [source] [discussion] 2025-01-03 05:09:11
>>bulatb+s4
Yeah, I'm trying too. Specifically, the GitHub Sponsors thing.

I'm opposed to advertising and don't want to inflict it on others. So I don't generally advertise my work on sites like this one, I just participate in threads about it whenever I see them.

Somehow people found my projects and posted them here. Just woke up one day and saw I had one sponsor. Not gonna lie, I'm still amazed about it. Not even close to providing for my family despite an incredibly favorable exchange rate, so I can't work full time on my projects. It's still the only thing that gives me hope right now. Really thankful to that person.

> And somehow just be fine with pulling up the ladder behind you.

Do you really think it will come to that? I mean, this AI situation has got to come to a head at some point. We can't have these corporations defending copyright and simultaneously pretending it doesn't exist while exploiting software developers. One of those things has got to go away.

replies(1): >>notpus+ap
◧◩◪
12. dend+37[view] [source] [discussion] 2025-01-03 05:21:08
>>liontw+R5
+1

My content is still MY content, and I'd prefer that if an entity is going to make money off of it directly (i.e., it's not a person learning how to code from something I wrote but rather a well-funded company pulling my content for their gain), that I at least have some semblance of consent to it.

That being said, I think there is no longer a point of crying over spilled milk. The LLM technology is out of the bag, and for every company that attempts to ethically manage content (are there any?) there will be ten that will disregard any kind of license/copyright notices and pull that content to train their models anyway.

I write because I want to be a better writer, and I enjoy sharing my knowledge with others. That's the motivation. If it helps at least one person, that's a win in my book, especially in the modern internet where there's so much junk scattered around.

replies(1): >>tonyed+063
◧◩
13. bulatb+47[view] [source] [discussion] 2025-01-03 05:21:14
>>Camper+42
Wouldn't you feel just a little bad if you worked really hard on something, gave it out for free in the spirit of sharing, and someone came along and said thanks, loser, and sold it for money? Would you want to go on making it for free for them to sell?
replies(3): >>Captai+kl >>prmous+Jv >>Camper+Tm1
◧◩
14. alison+J8[view] [source] [discussion] 2025-01-03 05:38:53
>>matheu+v2
Who cares? Information wants to be free. You put your stuff out there for free, it's hoovered up and sold back to you by capitalists, that sucks, but you've still made a real contribution to society. Meanwhile a select few will still find your stuff directly. Maybe what you shared will make just one person's life a little bit better, and that was your impact - you made a difference! Capitalists will never have that feeling, because anyone consuming their repackaged content is paying for the privilege - any benefit to society is just an incidental side-effect of their greed. Sucks to be them.

The way I see it, this is exactly what life is about. Do you want to make a positive impact in society? Then share your knowledge, your experiences, your creations. People will try to capitalize on your work, and they might even get rich from it, but oh well. It doesn't take away from your own contribution to the ongoing story of humanity.

I don't have or want kids, but I see my existence in society and free contributions to the "collective consciousness", such as it is, as my legacy. For me that's comforting. I'm choosing to be part of something bigger. If I just disappeared from society and lived like a hermit, or if I buried myself completely in my day job working for capitalists and not producing anything outside of that, I think I'd lose my sense of meaning.

replies(1): >>throwa+1L1
◧◩
15. spirit+99[view] [source] [discussion] 2025-01-03 05:42:04
>>Terr_+Z4
I love this, I did something like that with made-up-italian-sounding words a while ago (you used to be able to find my site if you looked for FANTACHIAVE).

It's a bit like fake roads on map databases)

◧◩
16. Anthon+6a[view] [source] [discussion] 2025-01-03 05:53:54
>>Camper+42
I think the source of the contrary sentiment goes something like this: AI stuff (especially image generation) is competition for artists. They don't much like competition that can easily undercut them on price, so they want to veto it somehow and lean on their go-to of accusing anybody who competes with them of theft.

The problem in this case is that it doesn't matter. The AI stuff is going to exist, and compete with them, whether the AI companies have to pay some pittance for training data or not.

But the chorus is made worse by two major factors.

First, many of the AI companies themselves are closed-source profiteers. "OpenAI" stepping all over themselves to be the opposite of their own name etc. If all the models got trained and then published, people would be much more inclined to say "oh, this is neat, I can use this myself and it knows my own work". But when you have companies hoovering everything up for free and then trying to keep the result proprietary, they look like scumbags and that pisses people off.

Second, then you get other opportunistic scumbags who try to turn that legitimate ire into their own profit by claiming that training for free should be prohibited so that only proprietary models can be created.

Whereas the solution you actually want is that anybody can train a model on public data but then they have to publish the model/weights. Which is probably not going to happen because in practice the law is likely to end up being what favors one of the scumbags.

replies(1): >>dend+Jb
17. dehrma+Aa[view] [source] 2025-01-03 06:02:33
>>rpcope+(OP)
> used/exploited by things like AI training bots

How is this worse than a human reading your blog/code, remembering the key parts of it, and creating something transformative from it?

replies(6): >>moron4+5b >>dend+hc >>sifar+vd >>ykonst+Tk >>mrweas+Qq >>camgun+eZ
◧◩
18. w4+Qa[view] [source] [discussion] 2025-01-03 06:05:37
>>Terr_+Z4
This idea is reminiscent of the opening scene of Accelerando by Charlie Stross:

Are you saying you taught yourself the language just so you could talk to me?"

"Da, was easy: Spawn billion-node neural network, and download Teletubbies and Sesame Street at maximum speed. Pardon excuse entropy overlay of bad grammar: Am afraid of digital fingerprints steganographically masked into my-our tutorials."

"Uh, I'm not sure I got that. Let me get this straight, you claim to be some kind of AI, working for KGB dot RU, and you're afraid of a copyright infringement lawsuit over your translator semiotics?"

"Am have been badly burned by viral end-user license agreements. Have no desire to experiment with patent shell companies held by Chechen infoterrorists. You are human, you must not worry cereal company repossess your small intestine because digest unlicensed food with it, right?”

- https://www.antipope.org/charlie/blog-static/fiction/acceler...

Amusing to also note that this excerpt predicted the current LLM training methodology quite well, in 2005.

replies(2): >>TeMPOr+Ru >>Terr_+sC2
◧◩
19. moron4+5b[view] [source] [discussion] 2025-01-03 06:07:54
>>dehrma+Aa
Seriously? How is rule utilitarianism different from act utilitarianism?
◧◩◪
20. dend+Jb[view] [source] [discussion] 2025-01-03 06:14:36
>>Anthon+6a
I think that's an overly reductive way of looking at it. Artists, are by their definition, creators of art. AI-generated "art" (it's not art at all in my eyes) is effectively a machine-based reproduction of actual art, but doesn't take the same skill level, time, and passion for the craft for a user to be able to generate an output, and certainly generates large profits for those that created the models.

So, imagine the scenario where you, an artist, trained for years to develop a specific technique and style, only for a massively funded company to swoop in, train a model on your art, make bank off of your skill while you get nothing, and now some rando can also create look-alikes (and also potentially profit from them - I've seen AI-generated images for sale at physical print stores and Etsy that mimic art styles of modern artists), potentially destroying your livelihood. Very little to be happy about here, to be frank.

It's less about competition and more about the ethical way to do it. If another artist would learn the same techniques and then managed to produce similar art, do you think there would be just as visceral of a reaction to them publishing their art? Likely not, because it still required skill to achieve what they did. Someone with a model and a prompt is nowhere near that same skill level, yet they now get to reap the benefits of the artist's developed craft. Is this "gatekeeping what's art"? I don't think so. Is this fair in any capacity? I don't think so either. Because we're comparing apples to pinecones.

All that being said, I do agree that the ship has sailed - the models are there, the trend of training on art AND written content shared openly will continue, and we're yet to see what the consequences of that will be. Their presence certainly won't stop me from continuously writing, perfecting my craft, and sharing it with the world. My job is to help others with it.

My hunch is that in the near-term we'll see a major devaluing of both written and image material, while a premium will be put on exceptional human skill. That is, would you pay to read a blog post written and thoroughly researched by Molly White (https://mastodon.social/@molly0xfff@hachyderm.io) or Cory Doctorow (https://pluralistic.net/), or some AI slop generated by an automated aggregator? My hunch is you'd pick the former. I know I would. As an anecdotal data point, and speaking just for myself, if I see now that someone uses AI-generated images in their blog post or site, I almost instantly close the tab. Same applies to videos on YouTube that have an AI-generated thumbnail or static art. It somehow carries a very negative connotation to me.

replies(2): >>Anthon+481 >>Camper+wn1
◧◩
21. dend+hc[view] [source] [discussion] 2025-01-03 06:20:33
>>dehrma+Aa
In the grand scheme of things and at this point, it probably doesn't matter. I know for me it certainly is not in any shape a discouragement to continue writing on my blog and contributing code to open source communities (my own and others).

But if we're going to dig into this a bit, one person reading my code, internalizing it, processing it themselves, tweaking it and experimenting with it, and then shipping something transformative means that I've enhanced the knowledge of some individual with my work. It's a win. They got my content for free, as I intended it to be, and their life got a tiny bit better because of it (I hope).

The opposite of that is some massively funded company taking my content, training a model off of it, and then reaping profits while the authors don't even get as much as an acknowledgement. You could theoretically argue that in the long-run, a LLM would likely help other people through my content that it trained on, but ethically this is most definitely a more-than-gray area.

The (good/bad) news is that this ship has sailed and we now need to adjust to this new mode of operation.

replies(1): >>dehrma+Bc
◧◩◪
22. dehrma+Bc[view] [source] [discussion] 2025-01-03 06:24:27
>>dend+hc
> The opposite of that is some massively funded company taking my content, training a model off of it, and then reaping profits while the authors don't even get as much as an acknowledgement.

Taking out the "training a model" part, the same thing could happen with a human at the company.

replies(2): >>dend+Uc >>thieaw+1e
◧◩◪◨
23. dend+Uc[view] [source] [discussion] 2025-01-03 06:29:20
>>dehrma+Bc
Oh, 100%. I mentioned this in another comment (>>42582518 ) - I've dealt with a fair share of stolen content (thankfully nothing too important, just a random blog post here and there), and it definitely stings. The difference is that this is now done at a massive scale.

But again - this doesn't stop me from continuing to write and publish in the open. I am writing for other people reading my content, and as a bouncing board for myself. There will always be some shape or form of actors that try to piggyback off of that effort, but that's the trade-off of the open web. I am certainly not planning to lock all my writing behind a paywall to stop that.

24. ehnto+qd[view] [source] 2025-01-03 06:36:14
>>rpcope+(OP)
I have decided not to put text online if I feel it has IP or personal ideas in it. Some exceptions, like posting here, and including stuff I want to get out there for commercial reasons eg: marketing of services. The one I struggle with is discord, but I am not too personal on discord servers so I suppose I'll just mesh into the soup of barely worthwhile chatter.

I also started self hosting my git repos and knowledge base, both were trivial to set up.

◧◩
25. sifar+vd[view] [source] [discussion] 2025-01-03 06:36:40
>>dehrma+Aa
Scale.
26. immibi+Rd[view] [source] 2025-01-03 06:39:55
>>rpcope+(OP)
For images, there's Nightshade, which imperceptibly alters your images but makes them poison for AI (does anyone understand why?)

I don't know if there's something similar for text. You could try writing nonsense with a color that doesn't contrast with the background.

The evidence Nightshade works is that AI companies want to make it illegal.

replies(2): >>kelsey+qt >>rcxdud+Iz
◧◩◪
27. immibi+Ud[view] [source] [discussion] 2025-01-03 06:40:58
>>chii+R
> but you agreed to this

Yes, that was the point? You agree to this by using Meta. So don't.

◧◩
28. Nitpic+Yd[view] [source] [discussion] 2025-01-03 06:42:12
>>panark+n
Ha, I have the same feeling on the recent good_social_network vs. bad_social_network debate that kinda goes on in the US. Looking from outside, it always felt that the main problem is control, and wanting more of it. The details, principles and "politics" don't matter in the grand scheme of things, it's control that people want, even tho they paint it differently.

bad_social_network was good 10 years ago, because it was controlled by "a friend of ours". Now it's controlled by someone who's perceived as "a friend of theirs" and it's therefore bad. So the politik aktivists move to good_social_network, and rave about the good there. Echo chambers be damned, we have control. Until the next "friend of theirs" buys it out, and rinse and repeat. So silly.

replies(1): >>frabcu+Cn
◧◩◪◨
29. thieaw+1e[view] [source] [discussion] 2025-01-03 06:42:56
>>dehrma+Bc
This is already a scenario that people generally accept as bad, could you elaborate the point you are making?
30. baxtr+2e[view] [source] 2025-01-03 06:43:20
>>rpcope+(OP)
Part of me viscerally agrees because large corporations have monetized UGC.

Another part of me though thinks differently. We are a species that builds knowledge from generation to generation. From one person to another. Over years, over centuries.

Philosophically this part tends to think that your thoughts and ideas belong to humanity and thus need to be shared with all of us.

replies(4): >>yowayb+qf >>Salgat+9h >>friend+gk >>yencab+ks1
◧◩◪
31. ehnto+5e[view] [source] [discussion] 2025-01-03 06:43:52
>>chii+R
Strong disagree on the last paragraph. It's data online, your data, and it was used for commercial purposes without your consent.

In fact, I never consented for anyone to access my server. Just because it has an IP address, does not make it a public service.

Obviously in a practical sense that is a silly position to take, and in prior cases there is usually an extenuating factor that got the person charged, eg breaking through access controls, violating ToS, or intellectual property violations.

But I don't rescind the prior statement. Just because I have an address doesn't mean you can come in through any unlocked doors.

replies(2): >>ahtihn+wg >>yencab+nr1
◧◩
32. immibi+ne[view] [source] [discussion] 2025-01-03 06:46:59
>>matheu+v2
There are two ways we can go from "copyright for me, no copyright for thee"

We can force it to "copyright for me, copyright for thee" by injecting AI poison and by not sharing at all. See Nightshade.

Or we can force it to "no copyright for me, no copyright for thee" by ignoring their copyright just like they ignore ours, and making sure they don't find us. See Anna's Archive.

replies(1): >>notpus+Uo
◧◩
33. yowayb+qf[view] [source] [discussion] 2025-01-03 07:00:43
>>baxtr+2e
Great take. Also agree with parent. I feel like some form of provenance would take us to the next level.
◧◩◪
34. Pittle+tf[view] [source] [discussion] 2025-01-03 07:01:03
>>chii+R
> but you agreed to this, when agreeing to the TOS

The legal definition of agreement means basically zilch

◧◩◪◨
35. ahtihn+wg[view] [source] [discussion] 2025-01-03 07:14:04
>>ehnto+5e
> In fact, I never consented for anyone to access my server. Just because it has an IP address, does not make it a public service.

If you don't take any steps to make it clear that it's not public, like an auth wall or putting pages on unguessable paths, then it is public, because that is what everyone expects.

Just like you if you have a storefront, if the door is unlocked you'd expect people to just come in and no one would take you seriously if you complain that people keep coming in if you don't somehow make it clear that they're not supposed to.

replies(1): >>DrScie+SE
◧◩
36. Salgat+9h[view] [source] [discussion] 2025-01-03 07:20:58
>>baxtr+2e
There's two decades worth of countless conversations on Reddit alone that would be buried into nothingness but instead ML has revived all that activity as useful data. ML is definitely a great way to bring back utility for a lot of old and unused data.
replies(2): >>tempes+ir >>Terr_+nD
37. TeMPOr+Gh[view] [source] 2025-01-03 07:27:55
>>rpcope+(OP)
> even if you self-host your own site, it's still going to get hoovered up and used/exploited by things like AI training bots. I think between everyone's code getting trained on, even if it's AGPLv3 or something similarly restrictive, and generally everything public on the internet getting "trained" and "transformed" to basically launder it via "AI", I can absolutely see why someone rational would want to share a whole lot less, anywhere, in an open fashion (...)

> (...) share with other real people than inadvertently working for free to enrich companies.

That attitude, quite commonly expressed on HN these days, strikes me as a peculiar form of selfishness - the same kind we routinely accuse companies of and attribute the sad state of society to.

A person is not entitled to 100% of the value of everything they do, much less to secondary value this subsequently generated. A person is not entitled to receive rent for any of their ideas just because they wrote them down and put on display somewhere. Just because they touched something, and it exists, doesn't mean everyone else touching it owes them money.

The society works best when people don't capture all the fruits of their labor for themselves. Conversely, striving to capture 100% (or more) of the value generated is a hallmark of the late stage capitalism and everything that's bad and wrong and Scrooge-y.

Self-censoring on principle because some company (gasp!) will train an LLM model on it (gasp!!) and won't share the profit from it? That's just feeling entitled to way over 100% of the value of one's hypothetical output, and feeling offended the society hasn't already sent advance royalty cheques.

Chill out. No matter what you do, someone else will somehow make money out of it, that's how it supposed to work - and AI in particular is, for better or worse, one of the most fundamentally transformative things to happen to humanity, somewhere between the Internet and the Industrial Revolution if it's just a bubble that pops, much more if it isn't. Assuming it all doesn't go to shit (let's entertain something more than maximum pessimism for a moment), everyone will benefit much more from it than from whatever they imagine they could get from their Internet comments.

(Speaking of Industrial Revolution - I can understand this attitude from people who actually earn a living from the kind of IP that AI is trained on, only to turn around and compete with them. They're the modern Luddites, and I respect their struggle and that they have a real point. Everyone else, those complaining about "AI theft" the most, especially here? Are not them.)

replies(3): >>johnkl+OY >>rpdill+xe1 >>yokem5+Dj1
38. blahbl+gj[view] [source] 2025-01-03 07:44:29
>>rpcope+(OP)
There's a neat little thing some discords I've seen use, where they honeypot spam bots into a channel-- if someone posts into it, their messages in the last 5 minutes get deleted and their account gets kicked.

Is there a meaningful way to make it so a website shares a resource that automatically updates their blacklist to block the IP address? Knowing that you will lose X but hopefully you'll retain everyone who can read?

replies(2): >>notpus+Fm >>atribe+yW
39. alibar+sj[view] [source] 2025-01-03 07:46:40
>>rpcope+(OP)
Based on my experience, I've found that I like using AI (GitHub copilot) to do things like answer questions about a language that I could easily verify in the documentation. Almost basically 'yes/no' questions. To be honest if I were writing such documentation for a product/feature, I wouldn't mind the AI hoovering it up.

I've found it to be pretty crap at doing things like actual algorithms or explaining 'science' - the kind of interesting work that I find on websites or blogs. It just throws out sensible looking code and nice sounding words that just don't quite work or misses out huge chunks of understanding / reasoning.

Despite not having done it in ages, I enjoy writing and publishing online info that I would have found useful when I was trying to build / learn something. If people want to pay a company to mash that up and serve them garbage instead, then more fool them.

replies(1): >>namari+Nv
◧◩
40. friend+gk[view] [source] [discussion] 2025-01-03 07:53:37
>>baxtr+2e
If you recall high school history, rapid, exponential "progress" happened once the knowledge was 1) written down (printing press) 2) archived for the future (libraries) 3) systematized (textbook/encyclopaedia) 4) proactively shared (public education), all on a massive scale.

The fact that some knowledge exists and is even accessible does not really matter if takes a highly trained in a very narrow field scholar to find that piece of information. You need a well established knowledge creation and distribution funnel in operation for humanity as a whole to reap the benefits of knowledge.

There is undoubtedly a lot of useful knowledge on internet platforms, however, most of that knowledge remains unsystematized and largely undiscoverable, meaning that contribution to the totality of human knowledge by these platforms is infinitesimal, which is further drowned by cat and porn videos.

replies(1): >>TeMPOr+7u
◧◩◪
41. entrop+Pk[view] [source] [discussion] 2025-01-03 07:59:58
>>chii+R
>But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it (which you've implicitly given permission to do so, by putting it up on the internet).

I don't understand how that matters. I thought that the whole idea of copyright and licences was that the holder of the rights can decide what is ok to do with the content and what is not. If the holder of the rights does not agree to a certain kind of use, what else is there to discuss?

It sure does not matter if I think that downloading a torrent is not any more pirating than borrowing a media from my friend.

replies(2): >>chii+Jo >>Terr_+XW1
◧◩
42. ykonst+Tk[view] [source] [discussion] 2025-01-03 08:00:27
>>dehrma+Aa
Scale makes all the difference in the world.
◧◩◪
43. Captai+kl[view] [source] [discussion] 2025-01-03 08:06:38
>>bulatb+47
No, not really? If others can get my stuff for free, then that means that whoever sells it for money must have done something to make it worth money. So they've earned it.
◧◩
44. notpus+Fm[view] [source] [discussion] 2025-01-03 08:26:13
>>blahbl+gj
Most scrapers use residential IPs nowadays. They will just rotate their IP and go on, while the IP you banned would get assigned to an innocent user that won’t be able to access your site now.
◧◩◪
45. frabcu+Cn[view] [source] [discussion] 2025-01-03 08:35:54
>>Nitpic+Yd
One of those social networks has a protocol and lets end users make their own feed algorithm and moderation system.

The other never has.

There can be technical differences between networks as well as social.

◧◩◪◨
46. chii+Jo[view] [source] [discussion] 2025-01-03 08:47:39
>>entrop+Pk
> If the holder of the rights does not agree to a certain kind of use, what else is there to discuss?

the holder of content does not automatically get to prescribe how i would use said content, as long as i comply with the copyrights.

The holder does not get to dictate anything beyond that - for example, i can learn from the content. Or i can berate it. Copyright is not a right that covers every single conceivable use - it is a limited set of uses that have been outlayed in the law.

So the current arguments center on the fact that it is unknown if existing copyright covers the use of said works in ML training.

replies(2): >>chroma+vs >>TheOth+lv
◧◩◪
47. notpus+Uo[view] [source] [discussion] 2025-01-03 08:48:47
>>immibi+ne
We can also do “copyright for thee, no copyright for me”! It does sound a bit hypocritical, but until we see where the copyright needle goes this might be the safest option.
◧◩◪◨
48. notpus+ap[view] [source] [discussion] 2025-01-03 08:50:49
>>matheu+W5
> So I don't generally advertise my work on sites like this one

Please do — I for one always love to hear about indie projects, if they are relevant to the topic discussed.

◧◩
49. mrweas+Qq[view] [source] [discussion] 2025-01-03 09:08:58
>>dehrma+Aa
Attribution. If you read a book, blog or code and others ask you where you got your ideas/inspiration you can refer them back to the original author. This helps people build a reputation. Even if it's just happens every once in a while, it still helps the original author.

Once an AI has hoover up your work and regurgitated it as it's own, all links back to the original creator is lost.

◧◩◪
50. tempes+ir[view] [source] [discussion] 2025-01-03 09:15:36
>>Salgat+9h
This seems like a reasonable take to me. I wish those downvoting you would explain where they disagree.
◧◩◪◨⬒
51. chroma+vs[view] [source] [discussion] 2025-01-03 09:27:13
>>chii+Jo
yeah, it is called _copy_ right. The question is, if AI is making obfuscated copies or not.

interestingly in German it is not called copyright, but Urheberrecht "authors rights". So there the word itself implies more things.

BTW at least in Germany you can own image rights of your art piece or building that is placed in a public place.

◧◩
52. kelsey+qt[view] [source] [discussion] 2025-01-03 09:36:54
>>immibi+Rd
Link to Nightshade: https://nightshade.cs.uchicago.edu/whatis.html

This is fascinating. Would be great to have a web interface artists can use that doesn't require them to install the software locally.

◧◩◪
53. TeMPOr+7u[view] [source] [discussion] 2025-01-03 09:44:01
>>friend+gk
Now we have 5) aggregated and internalized as a whole by computational constructs such as LLMs, which are - 4) - proactively shared (open weights, but also freemium service and dirt-cheap API access to commercial SOTA models), still on a massive scale.

> There is undoubtedly a lot of useful knowledge on internet platforms, however, most of that knowledge remains unsystematized and largely undiscoverable, meaning that contribution to the totality of human knowledge by these platforms is infinitesimal, which is further drowned by cat and porn videos.

Precisely that. Which is why I often argue, that for 99%+ of the content in the training data, its marginal contribution to the training process - itself infinitesimal in isolation - is still by far the most value that content will ever bring to the world.

54. cxr+Au[view] [source] 2025-01-03 09:48:42
>>rpcope+(OP)
It'd be great if you folks would stop showing up and derailing the comments with threads like this.
◧◩
55. pjc50+Nu[view] [source] [discussion] 2025-01-03 09:51:17
>>matheu+v2
> I really didn't expect to live in this "copyright for me, not for thee" world

Having been interested in copyright activism for two decades, that's exactly what I expected. Copyright is very much about power, and concentration of power.

◧◩◪
56. TeMPOr+Ru[view] [source] [discussion] 2025-01-03 09:52:24
>>w4+Qa
Also amusing:

> patent shell companies held by Chechen infoterrorists

This perfectly captures how both patent trolls and MAFIAA look like in my mind.

◧◩◪◨⬒
57. TheOth+lv[view] [source] [discussion] 2025-01-03 09:57:22
>>chii+Jo
Copyright means the holder does automatically get to prescribe how content can be copied. That's literally the definition of copyright.

A typical copyright notice for a book says something like (to paraphrase...) "not to be stored, transmitted, or used by or on any electronic device without explicit permission."

That clearly includes use for training, because you can't train without making a copy, even if the copy is subsequently thrown away.

Any argument about this is trying to redefine copyright as the right to extract the semantic or cultural value of a document. In reality the definition is already clear - no copying of a document by any means for any purpose without explicit permission.

This is even implicitly acknowledged in the CC definitions. CC would be meaningless and pointless without it.

replies(3): >>rcxdud+ty >>chii+mA >>rpdill+Ja1
◧◩◪
58. prmous+Jv[view] [source] [discussion] 2025-01-03 10:04:25
>>bulatb+47
The ones who really lose are the one who buy their stuff while yours stays free.
◧◩
59. namari+Nv[view] [source] [discussion] 2025-01-03 10:04:51
>>alibar+sj
I've argued years ago based on how LLMs are built that they would only ever amount to lossy and very memory inefficient compression algorithms. The whole 'hallucination' thing misses the mark. LLMs are not 'occasionally' wrong/hallucinating sometimes. They can only ever return lower resolution versions of what was on their training data. I was mocked then but I feel vindicated now.
replies(1): >>richar+wy
◧◩◪◨⬒⬓
60. rcxdud+ty[view] [source] [discussion] 2025-01-03 10:39:43
>>TheOth+lv
This a particularly extreme interpretation of copyright, and not one that has seen that much support in the courts. You can put what you like in a copyright notice or license, but it doesn't mean it'll succeed, and the courts have generally taken a dim view of any argument which relied on the fact that electronic data is technically copied many times just to make it viewable to a user. Copyright is probably better understood as distribution rights.

(Not saying training will necessarily fall in the same boat, just saying that the view 'copying to a screen or over the internet is necessarily a copy for the purposes of copyright' is reductive to the point of being outright incorrect)

◧◩◪
61. richar+wy[view] [source] [discussion] 2025-01-03 10:40:19
>>namari+Nv
They can combine two things in a way that never appeared together in the source material.
replies(1): >>namari+hQ
◧◩
62. rcxdud+Iz[view] [source] [discussion] 2025-01-03 10:54:20
>>immibi+Rd
Nightshade and glaze are basically adversarial attacks on various commonly used subcomponents of image generators, most notably the CLIP image captioner, which is both used to generate training data and as part of the generation process.

Like most adversarial attacks, they get more perceptible as they try to be robust to more transformations of the data (both in practice, i.e. applied to a level that's non-trivially removable, tend to make images look like slightly janky AI, ironically), and they are specific to the net(s) they are targeting, so it's more of a temporary defense against the current generation then a long-term protection.

◧◩◪◨⬒⬓
63. chii+mA[view] [source] [discussion] 2025-01-03 11:03:07
>>TheOth+lv
> That clearly includes use for training, because you can't train without making a copy, even if the copy is subsequently thrown away.

a copy for ingestion purposes - such as viewing in a browser, is not the same as a distribution copy that you make sending it to another person.

> the right to extract the semantic or cultural value of a document.

this right does not belong to the author - in fact, this is not an explicit right granted by the copyright act. Therefore, the extraction of information from a works is not something the author can (nor should) control. Otherwise, how would anyone learn off a textbook, music or art?

In the future, when the courts finally decide what the limits of ML training is, may be it will be a new right granted to authors. But it isn't one atm.

64. sneak+yB[view] [source] 2025-01-03 11:17:31
>>rpcope+(OP)
Copying data isn’t exploitation.
◧◩◪
65. sneak+WB[view] [source] [discussion] 2025-01-03 11:21:37
>>bulatb+s4
If a machine can replace a creative, the creative isn’t creative and should be replaced.
◧◩◪
66. Terr_+nD[view] [source] [discussion] 2025-01-03 11:38:28
>>Salgat+9h
> revived that activity as useful data

Revived as compressed text associations, it is potentially useful data, but also potentially totally wrong in non-obvious ways. (Or, to riff on Futurama, "The worst kind of incorrect.")

replies(1): >>Salgat+Db1
◧◩◪
67. DrScie+fE[view] [source] [discussion] 2025-01-03 11:46:33
>>chii+R
> But the new AI training methods are currently, at least imho, not a violation of copyright - not any more than a human eye viewing it

Interesting comparison - as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

ie in the human case it doesn't matter whether it went through an intermediate neural encoding - what matters is whether the output is sufficiently similar to be deemed a copy.

Surely the same is the case of AI?

replies(4): >>omnimu+pI >>Toucan+kW >>mystif+wD1 >>Kim_Br+Km2
◧◩◪◨⬒
68. DrScie+SE[view] [source] [discussion] 2025-01-03 11:52:45
>>ahtihn+wg
Your shop might be open sure - but aren't we talking about people coming in and taking whatever they like for free?

ie if you were an art gallery, the expectation would be people could come in and look, but you don't expect them to come in, photograph everything and then sell prints of everything online.

replies(1): >>chii+o21
◧◩◪◨
69. omnimu+pI[view] [source] [discussion] 2025-01-03 12:28:59
>>DrScie+fE
This whole AI learns like a human is trajectory of thought pushed by AI companies. They at same time try to humanize AI (it learns like a human would) and dehumanize humans (humans are stochastic parrots anyway). It's if anything a distraction if not straight up anti-human.

But you are right that copyright is complex and in the end decided by human (often in court). Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

On the other hand with images and art it's even more about cultural context. For example works of pop artists like Andy Warhol are for sure original works (even though some of it was disputed recently in court and lost). Nobody considers Andy Warhols work unoriginal even if it often looks very similar to some output it was riffing off because the essence is different to the original.

Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image. Also it's different to when human would want to replicate some artist style because humans can't do it 100% even if they want to. There is still piece of their "essence". There are many people who try to fake some famous artist style and sell it as real thing and simply can't do it. This is of course copyright infringement because of the intent but it's more original work than anything coming from LLMs.

replies(3): >>DrScie+RR >>Terr_+cW1 >>Kim_Br+0n2
◧◩◪◨
70. namari+hQ[view] [source] [discussion] 2025-01-03 13:39:41
>>richar+wy
Youtube compression algorithm also produces lots of artifacts that were never filmed by the video producers
replies(1): >>wizzwi+AY
◧◩◪◨⬒
71. DrScie+RR[view] [source] [discussion] 2025-01-03 13:52:55
>>omnimu+pI
It's both complex and extremely simple for the same reason - it's a human judgement in the end.

Just because you can't define something mathematically, doesn't mean it isn't obvious to most people in 99% of cases.

Reminds me of the endless games in tax law/avoidance/evasion and the almost pointless attempt to define something absolutely in words. To be honest you could simplify the whole thing by having a 'taking the piss' test - if the jury thinks you are obviously 'taking the piss' then you are guilty - and if you whine about the law not being clear and how it's unfair because you don't know whether or not you are breaking the law - well don't take the piss then - don't pretend you don't know whether something is an agressive tax dodge or not.

If you create some fake IP, and license it from some shell company in a low tax regime to nuke your profits in the country you are actually doing business in - let's not pretend we all can't see what you doing there - you are taking the piss.

Same goes for what some tech companies are doing right now - every reasonable person can see they are taking the piss - and high paid lawyers arguing technicalities isn't going to change that.

◧◩◪◨
72. Toucan+kW[view] [source] [discussion] 2025-01-03 14:28:28
>>DrScie+fE
The difference is an image generation algorithm does not consume images the way a human does, nor reproduce them that way. If you show a human several Rembrandt's and ask them to duplicate them, you won't get exact copies, no matter how brilliant the human is: the human doesn't know how Rembrandt painted, and especially if you don't permit them to keep references, you won't get the exact painting: you'll get the elements of the original that most stuck out to them, combined with an ethereal but detectable sense of their original tastes leaking through. That's how inspiration works.

If on the other hand you ask an image generator for a Rembrandt, you'll get several usable images, and good odds a few them will be outright copies, and decent odds a few of them will be configured into an etsy or ebay product image despite you not asking for that. And the better the generator is, the better it's going to do at making really good Rembrandt style paintings, which ironically, increases the odds of it just copying a real one that appeared many times in it's training data.

People try and excuse this with explanations about how it doesn't store the images in it's model, which is true, it doesn't. However if you have a famous painting by any artist, or any work really, it's going to show up in the training data many, many times, and the more popular the artist, the more times it's going to be averaged. So if the same piece appears in lots and lots of places, it creates a "rut" in the data if you will, where the algorithm is likely going to strike repeatedly. This is why it's possible to get full copied artworks out of image generators with the right prompts.

replies(2): >>chii+051 >>HanCli+861
◧◩
73. atribe+yW[view] [source] [discussion] 2025-01-03 14:30:43
>>blahbl+gj
I saw a Discord server use this but it never actually caught anything. Turns out all the spammers were just human idiots!
◧◩◪◨⬒
74. wizzwi+AY[view] [source] [discussion] 2025-01-03 14:48:03
>>namari+hQ
And datamoshing lets you produce effects that weren't in the source clips.
◧◩
75. johnkl+OY[view] [source] [discussion] 2025-01-03 14:49:27
>>TeMPOr+Gh
> The society works best when people don't capture all the fruits of their labor for themselves.

Sure, but it sounds like you think people shouldn't be upset about businesses trying to capture all the fruits of people's labor, too.

Capitalism is evil, and people thinking that normalizing exploitation is OK is either shortsighted or it's also evil. Are you simply unaware that this is what's happening and what people are upset about? Have you never thought about it? Or do you want businesses to succeed in exploiting people's work? It sounds like it, because you wrote, "that's how it supposed to work".

I truly wonder if you're self-aware, or if you just think that you'll one day be on the side of the exploiters.

◧◩
76. camgun+eZ[view] [source] [discussion] 2025-01-03 14:52:00
>>dehrma+Aa
One of the--admittedly many--things that puts me off AI is the pitch starts off as "you will have abilities you never had before and probably never would have had, be excited", then when critics are like, "woof it's a little worrying you can <generate a million deepfakes>, <send a million personalized phishing emails>, <scrape a million websites and synthesize new ones>, etc.", the pitch switches to "you could always have done this, calm down".

The whole point of software engineering is to do stuff faster than you could before. It is THE feature. We could already add, we could already FMA, we could already do matrix math, etc. etc. Doing it billions of times faster than we could before at far less energy expenditure--even including what it takes to build and deliver computers--has led to an explosion of productivity, discovery, and prosperity. Scale is the point. It changes everything and we know it; we shouldn't pretend otherwise.

◧◩◪◨⬒⬓
77. chii+o21[view] [source] [discussion] 2025-01-03 15:11:50
>>DrScie+SE
That's not what's happening.

Instead, it's that there's some people coming into your gallery, studying the art and its style, and leaving with the learned information. They then replicate that style in their own gallery. Of course, none of the images are copies, or would be judged to be copies by a reasonable person.

So now you, the gallery owner, want to forbid just those people who would come to learn the style. But you still want people to come and admire the art, and may be buy a print.

replies(1): >>DrScie+h31
◧◩◪◨⬒⬓⬔
78. DrScie+h31[view] [source] [discussion] 2025-01-03 15:18:18
>>chii+o21
> Of course, none of the images are copies, or would be judged to be copies by a reasonable person.

That's the fiction of course.

Tell me how something like ChatGPT can simultaneously claim to return accurate information while at the same time being completely independent from the sources of the information?

In terms of images - copyright isn't only for exact copies - it if was then humans would have been taking the piss by making minor changes for decades.

Sure you could argue some is fair use with genuinely original content being produced in the process, but I think you are also overlooking an important part of what's considered 'fair' - industrialised copying of source material isn't really the same in terms of fairness as one person getting inspiration.

Taking the Encylopedia Britanica and running it though an algorithm to change the wording, but not the meaning, and selling it on is really not the same as a student reading it and including those facts in their essay - the latter is considered fair use, the former is taking the piss.

replies(1): >>chii+O61
◧◩◪◨⬒
79. chii+051[view] [source] [discussion] 2025-01-03 15:29:43
>>Toucan+kW
> with the right prompts.

that is doing a lot of pull. Just because you could "get the full copies" with the right prompts, doesn't mean the weights and the training is copyright infringement.

I could also get a full copy of any works out of the digits of pi.

The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself. If you use the resulting model to output a copy of an existing work, then this act constitutes copyright infringement - in the exact same way that using photoshop to reproduce some works is.

What a lot of anti-ai arguments are trying to achieve is to make the act of training and model making the infringing act, and the claim is that the data is being copied while training is happening.

replies(1): >>DrScie+4c1
◧◩◪◨⬒
80. HanCli+861[view] [source] [discussion] 2025-01-03 15:36:47
>>Toucan+kW
We have the problem of too-perfect-recall with humans too -- even beyond artists with (near) photographic memory, there's the more common case of things like reverse-engineering.

At times, developers on projects like WINE and ReactOS use "clean-room" reverse-engineering policies [0], where -- if Developer A reads a decompiled version of an undocumented routine in a Windows DLL (in order to figure out what it does), then they are now "contaminated" and not eligible to write the open-source replacement for this DLL, because we cannot trust them to not copy it verbatim (or enough to violate copyright).

So we need to introduce a barrier of safety, where Developer A then writes a plaintext translation of the code, describing and documenting its functionality in complete detail. They are then free to pass this to someone else (Developer B) who is now free to implement an open-source replacement for that function -- unburdened by any fear of copyright violation or contamination.

So your comment has me pondering -- what would the equivalent look like (mathematically) inside of an LLM? Is there a way to do clean-room reverse-engineering of images, text, videos, etc? Obviously one couldn't use clean-room training for _everything_ -- there must be a shared context of language at some point between the two Developers. But you have me wondering... could one build a system to train an LLM from copywritten content in a way that doesn't violate copyright?

[0]: https://en.wikipedia.org/wiki/Clean-room_design

81. pixelm+M61[view] [source] 2025-01-03 15:40:53
>>rpcope+(OP)
I think you're right, and I don't think it's just about public content being "exploited" to train AI models and the like. Rather, even before LLMs, there was a growing sense that publishing ideas or essays publicly is "risky" with very little reward for the very real risks.

I wrote about this a little in "The Blog Chill":

https://amontalenti.com/2023/12/28/the-blog-chill

Speaking personally, among my social circle of "normie" college-educated millennials working in fields like finance, sales, hospitality, retail, IT, medicine, civil engineering, and law -- I am one of the few who runs a semi-active personal site. Thinking about it for a moment, out of a group of 50-or-so people like this, spread across several US states, I might be the only one who has a public essay archive or blog. Yet among this same group you'll find Instagram posters, TikTok'ers, and prolific DM authors in more private spaces like WhatsApp and Signal groups. A handful of them have admitted to being lurkers on Reddit or Twitter/X, but not one is a poster.

It isn't just due to a lack of technical ability, although that's a (minor) contributing factor. If that were all, they'd all be publishing to Substack, but they're not. It's that engaging with "the public" via writing is seen as an exhausting proposition at odds with everyday middle class life.

Why? My guesses: a) smartphones aren't designed for writing and editing, hardware-wise; b) long-form writing/editing is hard and most people aren't built for it; c) the dynamics of modern internet aggregation and agglomeration makes it hard to find independent sites/publishers anyway; and d) the risk of your developed view on anything being "out there" (whether professional risk or friendship risk) seems higher than any sort of potential reward.

On the bright side, for people who fancy themselves public intellectuals or public writers, hosting your own censorship-resistant publishing infrastructure has never been easier or cheaper. And for amateur writers like me, I can take advantage of the same.

But I think everyday internet users are falling into a lull of treating the modern internet as little more than a source of short-form video entertainment, streams for music/podcasts, and a personal assistant for the sundries of daily life. Aside from placating boredom, they just use their smartphones to make appointment reminders, send texts to a partner/spouse, place e-commerce orders, and check off family todo lists, etc. I expect LLMs will make this worse as a younger generation may view long-form writing not as a form of expression but instead as a chore to automate away.

◧◩◪◨⬒⬓⬔⧯
82. chii+O61[view] [source] [discussion] 2025-01-03 15:40:56
>>DrScie+h31
> ChatGPT can simultaneously claim to return accurate information while at the same time being completely independent from the sources of the information?

why can't that be true? Information is not copyrightable. The expression of information is. If chatGPT extracted information from a source works, and represent that information back to you in a form that is not a copy of the original works, then this is completely fine to me. An example would be a recipe.

replies(1): >>DrScie+091
◧◩◪◨
83. Anthon+481[view] [source] [discussion] 2025-01-03 15:48:01
>>dend+Jb
> It's less about competition and more about the ethical way to do it. If another artist would learn the same techniques and then managed to produce similar art, do you think there would be just as visceral of a reaction to them publishing their art? Likely not, because it still required skill to achieve what they did.

Now suppose that the other artist studies to learn the techniques -- several of them do -- and then Adobe offers them each two cents and a french fry to train a model on it, which many accept because the alternative is that the model exists anyway and they don't even get the french fry. Is this more ethical somehow? Even if you declined the pittance, you still have to compete with the model. Even if you accept it, it's only a pittance, and you still have to compete with the model. It hasn't improved your situation whatsoever.

> My hunch is that in the near-term we'll see a major devaluing of both written and image material, while a premium will be put on exceptional human skill.

AI slop is in the nature of "80% as good for 20% of the price" except that it's more like 40% as good for 0.0001% of the price. What that's going to do is put any artists below the 40th percentile out of work, make it a lot harder for the ones at the 60th percentile and hardly affect the ones at the 99th percentile at all.

But the other thing it's going to do is cause there to be more "art". A lot of the sites with AI-generated images on them haven't replaced a paid artist, they've replaced a site without images on it. Which isn't necessarily a bad thing.

◧◩◪◨⬒⬓⬔⧯▣
84. DrScie+091[view] [source] [discussion] 2025-01-03 15:55:34
>>chii+O61
So you think taking something like the Encylopedia Britanica, running it through a simple rewording algorithm, and selling it on is totally 'fair use'?

Taking all newspaper and proper journalistic output and rewording it automatically and selling it on is also 'fair use'?

Stand back from the detail ( of whether this pixel or word is the same or not ) and look at the bigger picture. You still telling me that's all fine and dandy?

I think it's obviously not 'fair use'.

It means the people doing the actual hard graft of gathering the news, or writing Encylopedias or Textbooks won't be able to make a living so these important activities will cease.

This is exactly the scenario copyright etc exists to stop.

replies(1): >>chii+vB2
◧◩◪◨⬒⬓
85. rpdill+Ja1[view] [source] [discussion] 2025-01-03 16:09:51
>>TheOth+lv
> Any argument about this is trying to redefine copyright as the right to extract the semantic or cultural value of a document. In reality the definition is already clear - no copying of a document by any means for any purpose without explicit permission.

I've studied copyright for over 20 years as an amateur, and I used to very much think this way.

And then I started reading court decisions about copyright, and suddenly it became extremely clear that it's a very nuanced discussion about whether or not the document can be copied without explicit permission. There are tons of cases where it's perfectly permissible, even if the copyright holder demands that you request permission.

I've covered this in other posts on Hacker News, but it is still my belief that we will ultimately find AI training to be fair use because it does not materially impact the market for the original work. Perhaps someone could bring a case that makes the case that it does, but courts have yet to see a claim that asserts this in a convincing way based on my reading of the cases over the past couple of years.

replies(1): >>Terr_+wX1
◧◩◪◨
86. Salgat+Db1[view] [source] [discussion] 2025-01-03 16:16:30
>>Terr_+nD
It is used to help train the LLMs on how to "talk" like normal people, even if the topic they're discussing isn't that useful or valuable.
◧◩◪◨⬒⬓
87. DrScie+4c1[view] [source] [discussion] 2025-01-03 16:20:17
>>chii+051
>The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself.

Interesting point - though the law can be strange in some cases - so for example in the UK in court cases where people are effectively being charged for looking at illegal images, the actual crime can be 'making illegal images' - simply because a precedence has been set that because any OS/Browser has to 'copy' the data of any image in order someone to be able to view it - any defendent has been deemed to copied it.

Here's an example. https://www.bbc.com/news/articles/cgm7dvv128ro

So to ingest something your training model ( view ) you have by definition have had to have copied it to your computer.

replies(1): >>xp84+Sn4
◧◩
88. rpdill+xe1[view] [source] [discussion] 2025-01-03 16:34:10
>>TeMPOr+Gh
This post deserves more attention, I think. It's occurred to me as well.

Over the holidays, my father gave my children a book that he had written. It was a photo essay that was 50 pages, and it was titled 'Sharks'. It's an unpublished labor of love that he spent about 500 hours on.

It's a true story centered on Captain Frank Mundus, who operated the Cricket II. He was a renowned shark fisherman and would take people out to fish for enormous sharks. He did this for 40 or 50 years.

An author by the name of Peter Benchley wrote a novel that was heavily inspired by many of Frank's traits, his mannerisms, his approach to shark fishing, the kind of boat he had, the kind of charters he ran. The novel was titled 'Jaws' and received little attention when it was first released. A while after, a director by the name of Steven Spielberg took notice of it and turned it into a multi-million dollar blockbuster movie.

My father was a lawyer that Frank Mundus consulted with and asked, is there any way that he could get a payout for being the inspiration for this character?

My family read the book over the holidays, and it was clearly my father's position that Steven Spielberg and Peter Benchley were maybe the sharks that the title of the book was talking about. The idea that they could make $100 million based on the work and life of this captain and give him literally nothing in return, not even attribution, seemed wrong to him.

I was the lone detractor in the room. My take is that Captain Frank Mundus was just living his life. He was doing what he did to make money chartering fishing trips for sharks. He would have done this regardless of whether or not a writer had come along or a movie had come along. What Peter Benchley and Steven Spielberg did is they found value in his work that he didn't know existed and that he wasn't capable of extracting. I think this is generally true of artists. They wander the world and they create art that gives the viewer a new insight into the experiences the artist had. If artists had to give money back to every real-life inspiration, I think the whole system wouldn't work.

I see parallels with the current attitudes toward AI. I think writers are a lot like Captain Mundus. They're living their life, they're writing their stories, or doing their research and publishing, and having people read their works. And copyright is helping them do all this.

AI companies have come along and found value in their work that they didn't know existed and they were never capable of extracting. And that's OK: that's what innovation is, taking the work that others have done and building on it to create something new.

I'm not unequivocally in favor of all applications of AI, but I do think there are tons of places that can be super helpful and we should allow it to be helpful. One example: I'm drafting this on my phone using Futo keyboard entirely with my voice. Extremely useful, but no doubt trained on copyrighted content.

89. cousco+6f1[view] [source] 2025-01-03 16:37:37
>>rpcope+(OP)
The low adoption rate is another big plus of the Gemini protocol and similar solutions for a Javascript-free and open internet.
◧◩
90. yokem5+Dj1[view] [source] [discussion] 2025-01-03 17:06:37
>>TeMPOr+Gh
The dilemma here is that the incentive to capture value for yourself comes from the legitimate fear that someone else will try to capture all that residual value you leave on the table instead of allowing that value to be socialized in a healthy way. Which means enshitification becomes the default for everyone.
◧◩◪
91. Camper+Tm1[view] [source] [discussion] 2025-01-03 17:25:47
>>bulatb+47
No. If I cared, I wouldn't have posted the information in the first place... or I would have erected a paywall.
◧◩◪
92. Camper+5n1[view] [source] [discussion] 2025-01-03 17:27:06
>>liontw+R5
Even if no attribution etc is your personal policy that’s not everyone else’s.

That's up to the courts. As usual, we will all lose if the copyright maximalists win.

replies(2): >>liontw+Q72 >>tonyed+F53
◧◩◪◨
93. Camper+wn1[view] [source] [discussion] 2025-01-03 17:29:26
>>dend+Jb
AI-generated "art" (it's not art at all in my eyes) is effectively a machine-based reproduction of actual art, but doesn't take the same skill level, time, and passion for the craft for a user to be able to generate an output, and certainly generates large profits for those that created the models.

(Shrug) Artists were wrong when they said the same thing about cameras at the dawn of photography, and they're wrong now.

If you expect to coast through life while everything around you stays the same, neither art nor technology is a great career choice.

replies(1): >>throwa+nM1
94. foxgla+zn1[view] [source] 2025-01-03 17:29:44
>>rpcope+(OP)
What is your purpose of publishing where having the content used to train AI is a problem? Are you trying to gatekeep information that's not even protected by copyright anyway? Are you worried your potential audience will get the same thing (including your personal creativity) from AI that just copied your work so they don't recognize your name and develop some brand awareness? Or do you just not like AI and don't want to help it? Maybe you could build your own paywall or other technical access restriction instead of making it freely available? Even just a captcha should block AI training scrapers, shouldn't it?
replies(1): >>asdff+Ln1
◧◩
95. asdff+Ln1[view] [source] [discussion] 2025-01-03 17:30:29
>>foxgla+zn1
No attribution with ai
replies(1): >>foxgla+ll2
◧◩◪◨
96. yencab+nr1[view] [source] [discussion] 2025-01-03 17:52:36
>>ehnto+5e
Content is often publicly available and copyright protected. Paint a mural near a busy street. No locked door in that metaphor; locked door would be password protected site.
◧◩
97. yencab+ks1[view] [source] [discussion] 2025-01-03 17:58:20
>>baxtr+2e
The compromise that was supposed to be in place was strong, short term, copyright protection, to help the author (a person) financially during their lifetime. That compromise was destroyed by rich people using corporations as owners and extending copyright duration.

https://en.wikipedia.org/wiki/Copyright_Term_Extension_Act

◧◩◪◨
98. mystif+wD1[view] [source] [discussion] 2025-01-03 19:15:54
>>DrScie+fE
Imagine I have a shit ton of data on the books people read, down to their favorite passage in each chapter.

I feed all of that into an algorithm that extracts the top n% of passages and uses NLP to string them into a semi-coherent new book. No AI or ML, just old fashioned statistics. Since my new book is comprised entirely of passages stolen wholesale from thousands of authors, clearly it's a transformative work that deserves its own copyright, and none of the original authors deserve a dime right? (/s)

What if I then feed my book through some Markov chains to mix up the wording and phrasing. Is this a new work or am I still just stealing?

AI is not magic, it does not learn. It is purely statistics extracting the top n% of other people's work.

◧◩◪
99. throwa+1L1[view] [source] [discussion] 2025-01-03 20:14:31
>>alison+J8
Everyone who put their stuff out there, on the Internet, has contributed to the AI Leviathan. Maybe the end result will be a utopia, maybe it'll be a dystopia. It's definitely too soon to say producing content for the AI titans to consume is a positive impact on society.
◧◩◪◨⬒
100. throwa+nM1[view] [source] [discussion] 2025-01-03 20:25:52
>>Camper+wn1
There is no great career choice when AI can do most intellectual work for a fraction of the cost.
◧◩◪
101. Terr_+3W1[view] [source] [discussion] 2025-01-03 21:36:49
>>chii+R
If they aren't a violation of copyright, then I want to see what happens when people are trading around models and "prompts" that describe recently released movies and music sufficiently that it competes with the original.

Not necessarily because I like either "we monetize public work" or "copyright robber-barons", but I'd like at least one of them to clearly lose so that the rest of us have clear and fair rules to work with.

◧◩◪◨⬒
102. Terr_+cW1[view] [source] [discussion] 2025-01-03 21:38:37
>>omnimu+pI
> This whole AI learns like a human is trajectory of thought pushed by AI companies.

My retort towards the " it would be legal if a human did it" argument is that if the model gets personhood then those companies are guilty of enslaving children.

> Compare that to pepople prompting directly with name of artist they want to replicate.

In that case, I would emphasize that the infringement is being done by the model, It's not illegal or infringing to ask for an unlicensed copyright infringing work. (Although it might become that way, if big corporations start lobbying for it.)

◧◩◪◨
103. Terr_+XW1[view] [source] [discussion] 2025-01-03 21:46:15
>>entrop+Pk
Not quite, it is (at least in the US) a limited privilege to control the copying and reproduction.

If you make a movie poster, and it goes out into the market, and then someone picks it up from a garage sale, copyright still applies, they can't just make tons of duplicates.

But you can't use copyright to force them to display it right side up instead of upside down, to not write on it, to not burn it, and to not make it into a bizarre pseudosexual shrine in their basement.

◧◩◪◨⬒⬓⬔
104. Terr_+wX1[view] [source] [discussion] 2025-01-03 21:50:48
>>rpdill+Ja1
I assume the emphasis there is on training, whereas it's totally possible to infringe by running the model in certain ways later.
replies(1): >>rpdill+tf2
◧◩◪◨
105. liontw+Q72[view] [source] [discussion] 2025-01-03 23:11:19
>>Camper+5n1
Last I checked creators of a work held copyright for it and that hasn’t changed. So no, this is not a new legal question
replies(1): >>Camper+O82
◧◩◪◨⬒
106. Camper+O82[view] [source] [discussion] 2025-01-03 23:18:00
>>liontw+Q72
That's not how copyright law works.

That's not how anything works.

replies(1): >>liontw+sN2
◧◩◪◨⬒⬓⬔⧯
107. rpdill+tf2[view] [source] [discussion] 2025-01-04 00:22:10
>>Terr_+wX1
Agreed! My take is that usages still can infringe if the output produced would otherwise infringe. I would take the fact that you use AI as the particular tool to accomplish the infringement as incidental.
◧◩◪
108. foxgla+ll2[view] [source] [discussion] 2025-01-04 01:15:18
>>asdff+Ln1
But also no direct copying, typically. Would OP really be happy with an AI that reliably rewords everything so that attribution is not required but still reproduces the information?
replies(1): >>rchaud+DE3
◧◩◪◨
109. Kim_Br+Km2[view] [source] [discussion] 2025-01-04 01:28:07
>>DrScie+fE
> as if a human viewed something, memorized it and reproduced in a recognisable way to be pretty much the same, wouldn't that still breach copyright?

> Surely the same is the case of AI?

That's close to my position.

Also, consider the case where you want to ask an image generator to not infringe copyright by eg saying "make the character look less like Donald Duck". In which case, the image generator still needs to know what Donald Duck looks like!

◧◩◪◨⬒
110. Kim_Br+0n2[view] [source] [discussion] 2025-01-04 01:30:53
>>omnimu+pI
> Consider how code infringement is not about code itself but about what it does. If you saw somewhat original implementation of something and then you rewrite it in different language by yourself there is high chance its still copyright infringement.

Actually if you rewrite it in a different language, you're well on your way to making it an independent expression; (though beware Structure, Sequence and Organization, unless you're implementing an API : See Google v. Oracle). Copyright protects specific expressions, not functionality.

> Compare that to pepople prompting directly with name of artist they want to replicate. This in direct copyright infringement in both essence and intention no matter the resulting image.

As far as I'm aware an artists' style is not something that is protected by law, Copyright protects specific works.

If you did want to protect artistic styles, how would you go about legally defining them?

replies(2): >>omnimu+MV2 >>omnimu+tX2
◧◩◪◨⬒⬓⬔⧯▣▦
111. chii+vB2[view] [source] [discussion] 2025-01-04 04:12:06
>>DrScie+091
> Taking all newspaper and proper journalistic output and rewording it automatically and selling it on is also 'fair use'?

it would be, if the transformation is substantial. If you're just asking for snippets of existing written works, then those snippets are merely derivative works.

For example, if you asked an LLM to summarize the news and stories of 2024, i reckon the output is not infringing. Because the informational contents of the news is not itself copyrightable, only the article itself. A summary, which contains a precis of the information, but not the original expression, is surely uncopyrightable - esp. if it is a small minority of the source (e.g., chatGPT used millions of sources).

> won't be able to make a living so these important activities will cease.

this is irrelevant as far as i'm concerned. They being able to make or not make a living is orthogonal. If they can't, then they should stop.

replies(1): >>DrScie+9b7
◧◩◪
112. Terr_+sC2[view] [source] [discussion] 2025-01-04 04:22:31
>>w4+Qa
More inspired by the GPL, I think, although the sketch above doesn't force the writer to put things into the public domain.

I'm imagining a separate declaration of: "Content I can sublicense from ShittyNewsLLM--which is everything made by their model--is now public-domain through me until further notice", without any need to identify specific items or rehost it myself.

I suppose the counterstrike would be for them to try to transform their own work and argue what they finally released contains some human spark that wasn't covered by the ToS, in which case there may need to be some "and any derivative work" kinda clause.

I wonder if some organization (similar to the Open Software Foundation) could get some lawyers and web-designers together to craft legally-sound site-design rules and terms-of-service, which anyone could use to protect their own blogs or web-forums.

◧◩◪◨⬒⬓
113. liontw+sN2[view] [source] [discussion] 2025-01-04 07:06:00
>>Camper+O82
Ok. Thanks for your contribution to the discussion.
◧◩◪◨⬒⬓
114. omnimu+MV2[view] [source] [discussion] 2025-01-04 09:07:52
>>Kim_Br+0n2
I dont believe rewrite in different language is specific expression.

We will see because we are well on our way of LLMs being able to translate whole codebases to different stack without a hitch. If thats OK than any of the copyleft, open-core or leaked codebases are up for grabs.

replies(1): >>Kim_Br+Xa3
◧◩◪◨⬒⬓
115. omnimu+tX2[view] [source] [discussion] 2025-01-04 09:31:28
>>Kim_Br+0n2
The fact LLMs are generating any images is purely thanks to database of source images that are copyright protected. Its a form of sophisticated automated photobashing. Photobashing is grayzone but often legal because of the other artist doing the (often original) work.

When you prompt for Mijazaki image this image can only exist thanks to his protected work being in database (where he doesnt want to be) otherwise the user wouldnt get Mijazaki image they wanted.

We will see how that all plays out but i think if Mijazaki took this to court there would be solid case on grounds that the resulting images breach the copyright of the source, are not original works and are created with bad intent that goes against protections of original author.

What seems to be current direction is atleast that the resulting images cannot be copyrighted automatically in public domain. Making it difficult to use commercially.

replies(2): >>Kim_Br+Jf3 >>Kim_Br+6j3
◧◩◪◨
116. tonyed+F53[view] [source] [discussion] 2025-01-04 11:52:49
>>Camper+5n1
To me it looks like individual creators are the ones most likely to lose.

I was watching an interview with John Warnock (one of the founders of Adobe) and he was proud of the fact that the US went from having 25,000 graphic designers to 2,500,000 largely thanks to software his company created.

I do wonder if we are on the verge of reversing that shift.

replies(1): >>Camper+iv3
◧◩◪◨
117. tonyed+063[view] [source] [discussion] 2025-01-04 11:57:42
>>dend+37
Pretty much all of my work has been published on the internet over the last twenty years. Some of it has been commercial, some open source and some that is just for myself.

I’m pretty much done with that now, I doubt I will publish anything online again.

◧◩◪◨⬒⬓⬔
118. Kim_Br+Xa3[view] [source] [discussion] 2025-01-04 13:16:36
>>omnimu+MV2
A hand rewrite (or intelligent rewrite in general) will tend to become unique pretty quickly, especially when you start leaning into language features of the target language for improved efficiency. Your Structure and Organization will be different.

If you order an LLM (or a human) to do a straight 1:1 translation, you'll sort of pass one test (it's a completely different language after all!), but fail to show much difference wrt structure, sequence or organization. I'm also not entirely sure how good of an idea it is technically. If you start iterating on it you can probably get much better results anyway. But then you're doing real creative work!

◧◩◪◨⬒⬓⬔
119. Kim_Br+Jf3[view] [source] [discussion] 2025-01-04 14:22:35
>>omnimu+tX2
There's no such database, AFAICT.

If you've ever worked with open source models (eg one of the stable diffusion models or models based on them, using tools such as AUTOMATIC1111 or ComfyUI); you can inspect them yourself and simply see. If you haven't done so already, see if you can figure out the installation instructions for one of the tools and try!

Meanwhile ...

Ok, fine, I've heard some crazy compression conspiracy theories, but they're a bit too crazy to be credible.

I've also heard stories about these models being intelligent - a little artist living in your computer. I think that's going a bit too far in another direction.

In reality, I think it's better to install the software and take your time to learn about the way these models are actually built and work.

[ btw: If Miyazaki were to take this to court with the argument you put forward, he wouldn't get very far. "Please remove my images from your systems in whatever form you are holding them". The response for the defense would simply be: "We don't actually have them, and you are quite welcome to inspect all our systems". ]

(Incidentally, I've been here before. I play with synths as a hobby! ;-)

◧◩◪◨⬒⬓⬔
120. Kim_Br+6j3[view] [source] [discussion] 2025-01-04 14:54:58
>>omnimu+tX2
Actually, while I just said "there is no database", maybe you're working from a very different mental model from mine...

What do you mean by "Database" in this context? What information do you think is being stored, (and how?)

replies(1): >>omnimu+4E3
◧◩◪◨⬒
121. Camper+iv3[view] [source] [discussion] 2025-01-04 16:40:58
>>tonyed+F53
The question you should be asking is if we need 2,500,000 graphic designers. Humans have a higher purpose than doing a robot's job.
replies(1): >>tonyed+oP4
◧◩◪◨⬒⬓⬔⧯
122. omnimu+4E3[view] [source] [discussion] 2025-01-04 17:45:15
>>Kim_Br+6j3
I understand what the model is and how you get to it. I know the training data is not stored. But as far as i understand - the model is closer to derived intermediary from the training data. Like database index or like you said form of compression.

Thats why i on purpose tend to call trainng data + model the database. Because to non progammers it makes more sense. To me there is intentional slight of hand of hiding the fact that the only reason LLMs can work as they do now is because of the source data. The way its usually marketed it seems like the model is program that generalised principles of drawing from looking and other drawings thats why it can draw like Mijazaki when it wants to. Not that it can draw Mijazaki because it preprocessed every Mijazaki drawing, stemmed patterns out of it and can mash them with other patterns (from the database).

Thats why i intentionally say database to lead this discussions back to what i see is core of these technologies.

replies(1): >>chii+gw4
◧◩◪◨
123. rchaud+DE3[view] [source] [discussion] 2025-01-04 17:50:45
>>foxgla+ll2
Human educational systems penalize you for not citing sources (i.e. not showing your work, not crediting referenced works). Why should an AI system be exempt?
replies(1): >>foxgla+rH5
◧◩◪◨⬒⬓⬔
124. xp84+Sn4[view] [source] [discussion] 2025-01-05 02:05:52
>>DrScie+4c1
That seems to be an artifact of the whole copyright thing predating all forms of computing and memory, but if we don’t ignore that one, we’ve all been illegally copying copyrighted text, images and videos into our RAM every time we use the Internet. So i think the courts now basically acknowledge that that doesnt count as a “copy.”

*Not a lawyer

replies(1): >>DrScie+af7
◧◩◪◨⬒⬓⬔⧯▣
125. chii+gw4[view] [source] [discussion] 2025-01-05 04:06:51
>>omnimu+4E3
What you're describing as database would be what i call information.
◧◩◪◨⬒⬓
126. tonyed+oP4[view] [source] [discussion] 2025-01-05 10:21:26
>>Camper+iv3
Humans have a higher purpose than doing whatever job robots can't.
replies(1): >>Camper+0y5
127. hulitu+315[view] [source] 2025-01-05 13:26:00
>>rpcope+(OP)
> your own site, it's still going to get hoovered up and used/exploited by things like AI training bots.

Firewall ?

◧◩◪◨⬒⬓⬔
128. Camper+0y5[view] [source] [discussion] 2025-01-05 18:08:34
>>tonyed+oP4
Whatever that purpose is, you're not going to achieve it while doing a job that robots can.
replies(1): >>tonyed+DO5
◧◩◪◨⬒
129. foxgla+rH5[view] [source] [discussion] 2025-01-05 19:20:35
>>rchaud+DE3
Because it's not a student at school, obviously. You didn't cite the source for the claim you just made there and you didn't need to.
◧◩◪◨⬒⬓⬔⧯
130. tonyed+DO5[view] [source] [discussion] 2025-01-05 20:17:21
>>Camper+0y5
Which is the point I was making in the first place.
◧◩◪◨⬒⬓⬔⧯▣▦▧
131. DrScie+9b7[view] [source] [discussion] 2025-01-06 12:09:30
>>chii+vB2
> this is irrelevant as far as i'm concerned. They being able to make or not make a living is orthogonal. If they can't, then they should stop.

It's not orthogonal - it's central. Copyright law and IP law isn't some abstract thing - it's a law with a purpose - to protect people from having their work ripped off in a way that they can no longer work.

If journalists can't gather the news, then sure events still happen but Google et al won't be able to summarise them as they will be no reports.

If scientific journals can no longer afford to operate as nobody needs to subscribe because anybody can get the content free via a rip-off, then there will be no scientific journals to rip-off.

Surely stealing stuff and selling it on is convenient for both big tech and consumers - but it's not a sustainable economic model.

◧◩◪◨⬒⬓⬔⧯
132. DrScie+af7[view] [source] [discussion] 2025-01-06 12:53:18
>>xp84+Sn4
Expect I've given you a concrete real counter example of where they do treat copying in memory as 'making a copy'.
[go to top]