Hi, I have a lot of real-world text data (hundreds of gigabytes) scraped from various sources like various websites (articles and comments), copyrighted or public domain books, and fanfiction.
I’m wondering how I can contribute using this data. I can easily grep these corpora for thousands of real-world examples of various mistakes. I’m wondering whether taking sentences out of context from these sources can cause any copyright-related problems.
Recently I reported a github issue regarding the AT_IN_THE_KITCHEN collocation rule. The specific example I mentioned was fixed, but doing a simple grep on my fanfiction dataset found many other false positives and incorrect suggestions involving this rule. It would be easiest for me to just take the first hundred hits of that rule and group them into sets like “rule worked correctly”, “incorrect suggestion”, “false positive”, upload these sentences to a pastebin site, and link it from a forum post or github issue.
I’m worried that doing this would cause some sort of copyright problems, though, so I’m making this post.