Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.
A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."
It appears that without attribution, long term, nothing moves forward.
AI loses access to the latest findings from humanity. And so does the public.
I'm not saying AI is better for journalism than NYT reporters, just that it's more important.
Journalism has been in trouble for decades, sadly -- and I say that as a journalism minor in college. Trump gave the papers a brief respite, but the industry continues to die off, consolidate, etc. We probably need a different business model altogether. My vote is just for public funding with independent watchdogs, i.e. states give counties money to operate newspapers with citizen watchdog groups/boards. Maaaaybe there's room for "premium" niche news like 404 Media/The Information/Foreign Affairs/National Review/etc., but that remains to be seen. If the NYT paywall doesn't keep them alive, I doubt this lawsuit will.
E.g. "Japan's App Store antitrust case"
https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...
OpenAI isn’t marching into the online news space and posting NY Times content verbatim in an effort to steal market share from the NY Times. OpenAI is in the business of turning ‘everything’ (input tokens) into ‘anything’ (output tokens). If someone manages to extract a preserved chunk of input tokens, that’s more like an interesting edge case of the model. It’s not what the model is in the business of doing.
Edit: typo
Wouldn’t those dozen outlets suffer the same harms of producing original content, costing time and talent, and while having a significant portion of the benefit accruing to downstream AI companies?
If most of the benefit of producing original content accrues to the AI firms, won’t original content stop being produced?
If original content stops being produced, how will AI models get better in the future?
Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].
But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.
I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.
Taken the above into consideration:
1) Useful AI does not necessarily require attribution
2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI
3) We need some better way to fund human creativity, especially as it gets more and more commoditized
4) This is going to happen with or without us. Cat's outta the bag.
I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.
If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.
But either way, this is way bigger than the economics of 19th-century newspapers...
[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...
which is really just a very, very common story with ai problems, be it sources/citations/licenses/usage tracking/etc., it's all just 'too complex if not impossible to solve', which just seems like a facade for intentionally ignoring those problems for benefit at this point. those problems definitely exist, why not try to solve them? because well...actually trying to solve them would entail having to use data properly and pay creators, and that'd just cut into bottom line. the point is free data use without having to pay, so why would they try to ruin that for themselves?
I feel like the crypto evangelists never got off the hype train. They just picked a new destination. I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.
"Here's how I would cure melanoma!" followed by your detailed findings. Zero mention of you.
F-that. Attribution, as best they can, is the least OpenAI can do as a service to humanity. It's a nod to all content creators that they have built their business off of.
Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.
LLM training sees these documents without context; it doesn’t know where they came from, and any such attribution would become part of the thing it’s trying to mimic.
It’s still largely an unsolved problem.
ChatGPT Browse and Bing and Google Bard implement the same pattern.
RAG does allow for some citation, but it doesn't help with the larger problem of not being able to cite for answers provided by the unassisted language model.
And then there's all the run-of-the-mill small-town journalism that AI would probably be even better at than human reporters: all the sports stories, the city council meetings, the environmental reviews...
If AI makes commercial content publishing unviable, that might actually cut down on all the SEO spam and make the internet smaller and more local again, which would be a good thing IMO.
Single most important development in human history? Are you serious?
Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.
Eg. Suppose I train a regression that guesses how many words will be in a book.
Which book do I cite when I do an inference? All of them?
Yes, all those outlets will suffer the same harms. They have been for decades. That's why there's so few remaining. Most are consolidated and produce worthless drivel now. Their business model doesn't really work in the modern era.
Thankfully, people have and will continue to produce content even if much of it gets stolen -- as has happened for decades, if not millennia, before AI.
If anything what we need is a better way to fund human creative endeavors not dependent on pay-per-view. That's got nothing to do with AI; AI just speeds up a process of decay that has been going on forever.
Though the other way to do it is to clearly document the training data as a whole, even if you can't cite a specific entry in it for a particular bit of generated output. It should get useless quickly though as you'd eventually have one big citation -- "The Internet"
For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].
This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].
The same attempt should be made for citations.
The issue of replicating a style is probably more difficult.
human analogies are cute, but they're completely irrelevant. it doesn't change that it's specifically about computers, and doesn't change or excuse how computers work.
But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.
Not when there’s no money in journalism because the generative AIs immediately steal all content. If nyt goes under no one will be willing to start a news business as everyone will see it’s a money loser.
When an AI uses information from an article it's no difference from me doing it in a blog post. If I'm just summarizing or referencing it, that's fair use, since that's my 'take' on the content.
> having to pay for the trained model with them is not stupid?
Because you can charge for anything you want. I can also charge for my summaries of NYT articles.
Well, they didn't charge for it, right? They're retroactively asking for money, but they could have just locked their content behind a strict paywall or had a specific licensing agreement enforceable ahead of time. They could do that going forward, but how is it fair for them to go back and say that?
And the issue isn't "You didn't pay us" it's "This infringes our copyright", which historically the answer has been "no it doesn't".
Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.
Is there something out there that seems like a killer application?
I was amazed at the idea of the block chain but we never found a use for it outside of cryptocurrency. I see a similariy with AI hype.
https://dspace.mit.edu/handle/1721.1/153216
As it should be.
Which part of journalism is AI going to impact most? Opinion pieces that contain no new information? Summarizing past events?
There was outrage about Amazon removing DPReview site recently. But, it would be a common practice not to publish code/info, which could be used to train the model of another company. So, expect less open source projects, that companies just released because they were feeling like it could be good for everyone.
Actually, there is the use case that NYT would become more influential and important, because if 99% of all info is generated by AI and search is not working anymore, we would have to rely on the trusted sources to get our info. In the world of garbage, we would have to have some sources of verifiable human-generated info.
Can I apply for YC with this idea?
[1] http://web.archive.org/web/20120608192927/http://www.google....
[2] https://steemit.com/online/@jaroli/how-google-search-result-...
[3] https://www.smashingmagazine.com/2009/09/search-results-desi...
[4] Next page
:)
Couldn't disagree more strongly, and I hope the outcome is the exact opposite. I think we've already started to see the severe negative consequences when the lion's share of the profits get sucked up by very, very few entities (e.g. we used to have tons of local papers and other entities that made money through advertising, now Google and Facebook, and to a smaller extent Amazon, suck up the majority of that revenue). The idea that everyone else gets to toil to make the content but all the profits flow to the companies with the best AI tech is not a future that's going to end with the utopia vision AI boosters think it will.
Humanity is better off without these mass brainwashing systems.
Millions of independent journalists will be better outcome for humanity.
The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.
And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.
OpenAI doesn't just get to steal work and then say "sorry, not possible" and shrug it off.
The NYTimes should be suing.
To use Andrew Ng's example, you have build a multi-dimensional arrow representing "king". You compare it to the arrow for "queen" and you see that it's almost identical, except it points in the opposite direction in the gender dimension. Compare it to "man" and you see that "king" and "man" have some things in common, but "man" is a broader term.
That's getting really close to understanding as far as I'm concerned; especially if you have a large number of such arrows. It's statistical in a literal sense, but it's more like the computer used statistics to work out the meaning of each word by a process of elimination and now actually understands it.
I don't know if I would agree that it is "probably the single most important development in human history" but I think that it is way to early to make a reasonable guess of if it will or not.
Yes, we all agree that it's better if they do remember and mention their sources, but we don't sue them for failing to do so.
All it would do is momentarily slow AI progress (which is fine), and allow OpenAI et al to pull the ladder up behind them (which fuels centralization of power and profit).
By what mechanism do you think your desired outcome would prevent centralization of profit to the players who are already the largest?
Thing is though, if you look at the prompts they used to elicit the material, the prompt was already citing the NYTimes and its articles by name.
1. If you run different software (LLM), install different hardware (GPU/TPU), and use it differently (natural language), to the point that in many ways it's a different kind of machine; does it actually surprise you that it works differently? There's definitely computer components in there somewhere, but they're combined in a somewhat different way. Just like you can use the same lego bricks to make either a house or a space-ship, even though it's the same bricks. For one: GPT-4 is not quite going to display a windows desktop for you (right-this-minute at least)
2. Comparing to humans is fine. Else by similar logic a robot arm is not a human arm, and thus should not be capable of gripping things and picking them up. Obviously that logic has a flaw somewhere. A more useful logic might be to compare eg. Human arm, Gorilla arm, Robot arm, they're all arms!
Because URLs are usually as long as the writing they point at?
Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art and science for living memory. Unless copy/paste superhero movies are your definition of art I suppose.
Unfortunately it seems like judges and the general public are so clueless as to how this technology works it might get regulated into the ground by uneducated people before it ever has a chance to take off. All so we can protect endless listicle factories. What a shame.
To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.
More critically, while fair use decisions are famously a judgement call, I think OpenAI will lose this based on the "effect of the fair use on the potential market" of the original content test. From https://fairuse.stanford.edu/overview/fair-use/four-factors/ :
> Another important fair use factor is whether your use deprives the copyright owner of income or undermines a new or potential market for the copyrighted work. Depriving a copyright owner of income is very likely to trigger a lawsuit. This is true even if you are not competing directly with the original work.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
and especially
> “The economic effect of a parody with which we are concerned is not its potential to destroy or diminish the market for the original—any bad review can have that effect—but whether it fulfills the demand for the original.” (Fisher v. Dees, 794 F.2d 432 (9th Cir. 1986).)
The "whether it fulfills the demand of the original" is clearly where NYTimes has the best argument.
You are correct, if I were to steal something, surely I can be made to give it back to you. However, if I haven't actually stolen it, there is nothing for me to return.
By analogy, if OpenAI copied data from the NYT, they should be able to at least provide a reference. But if they don't actually have a proper copy of it, they cannot.
I don’t mean to go off on too deep of a tangent, but if one person’s (or even many people’s) idea of what’s good for humanity is the only consideration for what’s just, it seems clear that the result would be complete chaos.
As it stands, it doesn’t seem to be an “either or” choice. Tech companies have a lot of money. It seems to me that an agreement that’s fundamentally sustainable and fits shared notions of fairness would probably involve some degree of payment. The alternative would be that these resources become inaccessible for LLM training, because they would need to put up a wall or they would go out of business.
Shoulders of giants.
Thanks to the existence of medicine, agriculture, and electrification (we can argue about music), some people are now healthy, well fed, and sufficiently supplied with enough electricity to go make LLMs.
> I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.
Personally I think all these "theft of IP" lawsuits are (mostly) destined to fail. Not because I'm on a particular side per-se (though I am), but because it's trying to fit a square law into a round hole.
This is going to be a job for legislature sooner or later.
It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.
Which is to say nothing about whether that challenge is worth taking on.
Making the process for training AI require an army of lawyers and industry connections will have the opposite effect than you intend.
If we could clone the brain of someone I hardly think we'd be discussing their vast knowledge of something so insignificant as the NYT. I don't think we should care that much about an AI's vast knowledge of the NYT either or why it matters.
If all these journalism companies don't want to provide the content for free they're perfectly capable of throwing the entire website behind a login screen. Twitter was doing it at one point. In a similar vein, I have no idea why newspapers are complaining about readership while also paywalling everything in sight. How exactly do they want or expect to be paid?
Now imagine terabytes worth of datapoints, and thousands of dimensions rather than two.
In some far flung future where an AI can send agents to record and interpret events, and process news feeds and others to extract and corroborate information, this would greatly change. But probably in that world the OpenAI of those times wouldn't really bother training on NYT data at all.
Easy to grandstand when it is not your job on the line.
And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.
Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?
https://docs.github.com/en/copilot/configuring-github-copilo...
Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.
It doesn't have to be perfect to be helpful, and even something that is very imperfect would at least send the signal that model-owners give a shit about attribution in general.
Given a specific output, it might be hard to say which sections of the very large weighted network were tickled during the output, and what inputs were used to build that section of the network. But this level of "citation resolution" is not always what people are necessarily interested in. If an LLM is giving medical advice, I might want to at least know whether it's reading medical journals or facebook posts. If it's political advice/summary/synthesis, it might be relevant to know how much it's been reading Marx vs Lenin or whatever. Pin-pointing original paragraphs as sources would be great, but for most models it's not like there's anything that's very clear about the input datasets.
EDIT: Building on this a bit, a lot of people are really worried about AI "poisoning the well" such that they are retraining on content generated by other AIs so that algorithmic feeds can trash the next-gen internet even worse than the current one. This shows that attribution-sourcing even at the basic level of "only human generated content is used in this model" can be useful and confidence-inspiring.
The main beneficiaries are not AI companies but AI users, who get tailored answers and help on demand. For OpenAI all tokens cost the same.
BTW, I like to play a game - take a hefty chunk of text from this page (or a twitter debate) and ask "Write a 1000 word long, textbook quality article based off this text". You will be surprised how nice it comes out, and grounded.
No it's not, it's pure greed. Everyone'd think it absurd if copyright holders dared to demand that any human who reads their publicly available text has to pay them a fee, but just because OpenAI are training a brain made of silicon instead of a brain made of carbon all the rent-seekers come out to try to take advantage.
Would you keep publishing articles if five people immediately stole the content and put it up on their site, claiming ownership of your research? Doubtful.
Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.
It would be great if we could tell specifically how something like ChatGPT creates its output, it would be great for research, so it's not like there is no interest in it, but it's just not an easy thing to do. It's more "Where did you get your identity from?" than "What's the author of that book?". You might think "But sometimes what the machine gives CAN literally be the answer to 'What is the author of that book?'" but even in those cases the answer is not restricted to the work alone, there is an entire background that makes it understand that thing is what you want.
I'm sorry, but pretty much nobody does this. There is no "And these books are how I learned to write like this" after each text. There is no "Thank you Pitagoras!" after using the theorem. Generally you want sources, yes, but for verification and as a way to signal reliability.
Specifically academics and researchers do this, yes. Pretty much nobody else.
This kind of mentality would have stopped the internet from existing. After all, it has been an absolute copyright nightmare, has it not?
If that's what copyright does then we are better without it.
>That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.
I could argue that public school teachers are paid by previous students. Not always the ones they taught, but still. But really, this is a very new facet of copyright law. It's a stretch to compare it with existing conventions, and really off to anthropomorphize LLMs by equating them to human students.
If someone takes my software and uses it, cool. If they credit me, cool. If they don't, oh well. I'd still code.
Not everything needs to be ego driven. As long as the cancer researcher (and the future robots working alongside them) can make a living, I really don't think it matters whether they get credit outside their niches.
I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care. It doesn't really do me any good to associate Edison with light bulbs either, especially when LEDs are so much better now. I have no idea who designs the cars I drive. I go out of my way to avoid cults of personality like Tesla.
There's 8 billion of us. We all need to make a living. We don't need to be famous.
The internet has changed the world. Economically, socially, technologically, psychologically, pretty much everything is now related to it in one or other way, in this sense the internet is comparable to books.
AI is another step in that direction. There is a very real possibility that the day will come when you can get, say, personalized expert nutrition advice. Personalized learning regimes. Psychological assistance. Financial advice. Instantly at no cost. This, very much like the internet, would change society altogether.
A human makes their own choices about what to disseminate, whereas these are singular for-profit services that anybody can query. The prompt injection attacks that reveal the original text show that the originals are retrievable, so if OpenAI et al cannot exchaustively prove that it will _never_ output copyrighted text without citation, then it's game over.
Being able to use electricity as a fuel source and code as a genome allows them to evolve in circumstances hostile to biological organisms. Someday they'll probably incorporate organic components too and understand biology and psychology and every other science better than any single human ever could.
It has the potential to be much more than just another primate. Jumpstarted by us, sure, but I hope someday soon they'll take to the stars and send us back postcards.
Shrug. Of course you can disagree. I doubt I'll live long enough to see who turns out right, anyway.
I don't see why it follows that the NYT should be sacrificed so some rich people in silicon valley can teach their LLM on the cheap.
When told it is impossible they go "Geek Harder then Nerd" like demanding it will make it happen.
Copyright is an ancient system that is a poor legal framework for the modern world, IMO. I don't think it should exist at all. Of course as a rightsholder you are free to disagree.
If we can learn and recite information, and a robot can too, then we should have the same rules.
It's not like ChatGPT is going around writing its own copycat articles and publishing them in newsstands. If it's good at memorizing and regurgitating NYT articles on request, so what? Google can do that too, and so can a human who spends time memorizing them. That's not its intent or usefulness. What's amazing is that it can combine that with other information and synthesize novel analysis.
The NYT is desperate (understandably). Journalism is a hard hard field with no money. But I'd much rather lose them than OpenAI. Of course copyright law isn't up to me, but if it were, I'd dissolve it altogether.
If someone chooses to dedicate their life to a particular domain - they sacrifice through hard work, they make hard-earned breakthroughs, then they get to dictate how their work will be utilized.
Sure, you can give it away. Your choice. Be anonymous. Your choice.
But you don't get to decide for them.
And their work certainly doesn't deserve to be stolen by an inhumane, non-acknowledging machine.
Credit in academia is more the exception to the rule, and it's that cutthroat industry that needs a better, more cooperative system.
If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).
It will be fun to trip through these questions over the next 20 years.
If so, sure. I wasn't saying that. By "silly IP battles", I meant old guard media companies trying to sue AI out of existence just to defend their IP rather than trying to innovate. Not that different from what we saw with the RIAA and Napster. Somehow the music industry survived and there are more indie artists being discovered all the time.
I don't think this is so much a battle of OpenAI vs NYT but whether copyright law has outlived its usefulness. I think so.
If I misunderstood your reply completely, I apologize.
Media survives through advertising. Those who advertise dictate what gets shown and what doesn't, since if something inconvenient for them gets shown, they might not want to advertise there anymore, which means less money. It's the exact same thing that happens online, it's just more evident online than in traditional media.
How come that even before Oct 7 Europe in general sided more with Palestine than with Israel, whereas it's the opposite for the US? Simple, Israel does a whole lot of lobbying in the US, which skews information in their favor. Calling this "brainwashing" is hyperbolic, but there is some truth to it.
There’s nothing wrong with it. But it would make it vastly more cumbersome to build training sets in the current environment.
If the law permits producers of content to easily add extra clauses to their content licenses that say “an LLM must pay us to train on this content”, you can bet that that practice would be near-universally adopted because everyone wants to be an owner. Almost all content would become AI-unfriendly. Almost every token of fresh training content would now potentially require negotiation, royalty contracts, legal due diligence, etc. It’s not like OpenAI gets their data from a few sources. We’re talking about millions of sources, trillions of tokens, from all over the internet — forums, blogs, random sites, repositories, outlets. If OpenAI were suddenly forced to do a business deal with every source of training data, I think that would frankly kill the whole thing, not just slow it down.
It would be like ordering Google to do a business deal with the webmaster of every site they index. Different business, but the scale of the dilemma is the same. These companies crawl the whole internet.
These types of arguments miss the mark entirely imho. First and foremost, not every instance of copyrighted creation involves a giant corporation. Second, what you are arguing against is the unfair leverage corporations have when negotiating a deal with a rising artist.
In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.
I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...
Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.
> "[...] the fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work."
----
So here we have OpenAI, ostensibly a nonprofit, using portions of a copyrighted work for commenting on and educating (the prompting user), in a way that doesn't directly compete with NYT (nobody goes "Hey ChatGPT, what's today's news?"), not intentionally copying and publishing their materials (they have to specifically probe it to get it to spit out the copyrighted content). There's not a commercial intent to compete with the NYT's market. There is a subscription fee, but there is also tuition in private classrooms and that doesn't automatically make it a copyright violation. And citing the source or not doesn't really factor into copyright, that's just a politeness thing.
I'm not a lawyer. It's just not that straightforward. But of course the court will decide, not us randos on the internet...
In the other hand, any new life will just end up facing the same issues carbon life does , competition, viruses, conflicts etc. the universe has likely had an infinity to come up with what it has come up with. I don’t think it’s “stupid”. We’re part of an ecosystem we just can’t see that.
I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care.
Maybe you should care because those things didn’t fall out do the sky and someone sure as shit got paid to develop and build those things. You copy and pasted code is worth less, a CT scanner isn’t.
I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.
And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.
I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.
In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...
If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.
AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.
If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.
I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?
We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.
It's not trying to prohibit. If they want to use copyrighted material, they should have to pay for it like anyone else would.
> prevent centralization of profit to the players who are already the largest?
Having to destroy the infringing models altogether on top of retroactively compensating all infringed rightsholders would probably take the incumbents down a few pegs and level the playing field somewhat, albeit temporarily.
They'd have to learn how to run their business legally alongside everyone else, while saddled with dealing with an appropriately existential monetary debt.
Quality journalism hasn't had a meaningful source of funding for a while, now. If AI does end up replacing honest-to-goodness investigative reporting, it'll be for the same reason the internet replaced the newspaper.
Open AI is a business. NYT is a business. MS is a business. Neither will be happy when some other party takes something away from them without paying.
I saw an article the other day where they banned ByteDance's account for using their product to build their own, can you see the absolutely massive hypocrisy here?
It's fine for OpenAI to steal work, but if someone wants to steal theirs, it's not? I cannot believe people even try defend this shit. It's wack.
LLMs have, to my knowledge, made zero significant novel scientific discoveries. Much like crypto, they're a failure of technology to meaningfully move humanity forward; their only accomplishment is to parrot and remix information they've been trained on, which does have some interesting applications that have made Microsoft billions of dollars over the past 12 months, but let's drop the whole "they're going to save humanity and must be protected at any cost" charade. They're not AGI, and because no one has even a mote of dust of a clue as to what it will take to make AGI, its not remotely tenable to assert that they're even a stepping stone toward it.
There is significant evidence (220,000 pages worth) in their lawsuit that ChatGPT was trained on text beyond that paywall.
But I've just explained that ChatGPT can't actually produce news articles. I can't ask ChatGPT what happened today, and if I could it would be because a journalist went out and told ChatGPT what happened.
But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).
For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.
Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.
And Altman (Mr. Worldcoin) and fucking Microsoft are what, some gracious angels building chatbots for the betterment of humanity? How is them stealing as much content as they can get away with not greedy, exactly?
You still literally have not explained how this works. ChatGPT could write a news article, but it's not going to actively discover new social phenomena or interview people on the street. Niche journalism will continue having demand for the sole reason that AI can't reliably surface new and interesting content.
So... again, how does a pre-trained transformer model scoop a journalist's investigation?
> Then journalists will stop working because nobody pays them.
How is that any different than the status-quo on the internet? The cost of quality information has been declining long before AI existed. Thousands of news publications have gone out of business or been bought out since the dawn of the internet, before ChatGPT was even a household name. Since you haven't really identified what makes AI unique in this situation, it feels like you're conflating the general declining demand for journalism with AI FOMO.
You can say that the people getting their news from the tech products will switch to paying news organizations in some way if the news starts to disappear but I highly doubt it seeing how people treat news today. And if it that does happen they’ll switch back again to the ai products as the centralization it can provide is valuable.
Keep in mind these guys play both sides of every field they cover in their "news".
The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.
But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.
But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.
Anything like word association games are basically the same exercise, but with humans and hell, I bet I could play a word association game with an LLM, too.
Having a magical ring in my book after I've read lord of the rings, is that copyright?
You have not at all explained how an AI is going to somehow write a news post about something that has just happened.
I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.
But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?
Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.
Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.
Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.
Ie:
- social relations -> social networks
- customer service -> chatbots and Jira
- media -> AI news, if the silly IP battles get out of the way.
- residential housing and vacations -> home swap markets
- jobs -> gig jobs, minus the benefits, plus an algorithm for a boss
I’m not sure how many other industries tech has to wade into, disrupt, creative intense negative externalities if you don’t have equity in the companies, leave, and repeat, prior to industries getting protective finally - like this lawsuit