zlacker

[parent] [thread] 5 comments
1. pavlov+(OP)[view] [source] 2023-12-27 14:27:43
The NYT publishes about 200 pieces of journalism every day (according to their own website), and it was founded in 1851. That makes for a lot of articles.
replies(3): >>engine+a1 >>midasu+qb >>naltro+3h
2. engine+a1[view] [source] 2023-12-27 14:34:48
>>pavlov+(OP)
(2023 - 1851) * 365 * 200 = 12,556,000

Yep, so a few million ripped off articles is plausible.

replies(1): >>zozbot+L1
◧◩
3. zozbot+L1[view] [source] [discussion] 2023-12-27 14:37:47
>>engine+a1
Everything from 1851 to 1927 ought to be in the public domain, though. If the goal of training an AI is just "to mimic a style" there are absolutely humongous amounts of text that's totally free of any copyright restrictions.
replies(1): >>hnarn+C2
◧◩◪
4. hnarn+C2[view] [source] [discussion] 2023-12-27 14:42:43
>>zozbot+L1
Yes, there is large amounts of public domain text available, but does anyone believe this is a restriction that was imposed when feeding the models?
5. midasu+qb[view] [source] 2023-12-27 15:31:00
>>pavlov+(OP)
The first 75+ years are no longer in copyright, so certainly possible to train on thousands maybe millions of NYT articles without concern.
6. naltro+3h[view] [source] 2023-12-27 16:03:52
>>pavlov+(OP)
Copyright in the US persists for 70 years after the publisher's death.

So the earliest available copyrighted material would be all content published by anybody who died in the year 1953 or earlier.

If the author of an article published in 1950 still has a living author, the work is still copyrighted.

[go to top]