Using example sentences from copyrighted sources

Mateon1 · July 13, 2019, 8:12pm

Hi, I have a lot of real-world text data (hundreds of gigabytes) scraped from various sources like various websites (articles and comments), copyrighted or public domain books, and fanfiction.

I’m wondering how I can contribute using this data. I can easily grep these corpora for thousands of real-world examples of various mistakes. I’m wondering whether taking sentences out of context from these sources can cause any copyright-related problems.

Recently I reported a github issue regarding the AT_IN_THE_KITCHEN collocation rule. The specific example I mentioned was fixed, but doing a simple grep on my fanfiction dataset found many other false positives and incorrect suggestions involving this rule. It would be easiest for me to just take the first hundred hits of that rule and group them into sets like “rule worked correctly”, “incorrect suggestion”, “false positive”, upload these sentences to a pastebin site, and link it from a forum post or github issue.
I’m worried that doing this would cause some sort of copyright problems, though, so I’m making this post.

dnaber · July 13, 2019, 8:13pm

Hi, help like that would be very welcome. I can’t comment on the copyright issue, but if you feel that might be an issue, you could maybe protected the data with a password so only LT contributors trying to work on that issue would get access.