idanm
(Idan Morad)
October 2, 2016, 7:44am
1
Hi,
I created the method:
public boolean isCompoundWord(String word)
using JLanguageTool. first I deactivated any rule that isn’t related to spell checking, and then I try to split the word at any char and see if the one of the words isn’t a spelling mistake. The running time for this method takes a lot of time of a text of 9 sentences vs. just check all the sentences together. Is there a simple way to implement this function or a way to just use the JLanguageTool English dictionary and check if a word exists?
dnaber
(Daniel Naber)
October 2, 2016, 8:15am
2
How exactly did you implement isCompoundWord
? Could you post the code?
idanm
(Idan Morad)
October 5, 2016, 7:27am
3
@Component
public class SpellChecker
{
protected final static Logger logger = LoggerFactory.getLogger(SpellChecker.class);
private final Language language;
private final UrlsCleaner urlsCleaner;
private final Names names;
private final WordHelper wordHelper;
private final Acronym acronym;
private final HtmlTagsRemoval htmlTagsRemoval;
private final List<String> preliminaryWordsToIgnore;
private Predicate<Rule> rulePredicate;
public SpellChecker(Language language, UrlsCleaner urlsCleaner, Names names, WordHelper wordHelper,
Acronym acronym, HtmlTagsRemoval htmlTagsRemoval, List<String> wordsToIgnore,
Predicate<Rule> rulePredicate)
{
this.language = language;
this.urlsCleaner = urlsCleaner;
this.names = names;
this.wordHelper = wordHelper;
this.acronym = acronym;
this.htmlTagsRemoval = htmlTagsRemoval;
this.rulePredicate = rulePredicate;
this.preliminaryWordsToIgnore = wordsToIgnore;
}
public SpellChecker(Language language, UrlsCleaner urlsCleaner, Names names, WordHelper wordHelper,
Acronym acronym, HtmlTagsRemoval htmlTagsRemoval, List<String> wordsToIgnore)
{
this(language, urlsCleaner, names, wordHelper, acronym, htmlTagsRemoval, wordsToIgnore, rule -> false);
}
/**
* Prof-reading a given text and return the mistakes found for plain text.
*
* @param text the text to check.
* @param wordsToIgnore array of words to ignore from spelling check (usually names). can be empty.
* @return list of words with spelling or grammar mistakes.
*/
public List<String> checkPlainText(String text, List<String> wordsToIgnore)
{
JLanguageTool languageTool = new JLanguageTool(language);
addAcceptedTerms(languageTool, wordsToIgnore);
deactivateRulesByPredicate(languageTool, rulePredicate);
List<RuleMatch> matches = new ArrayList<>();
try
{
matches = languageTool.check(text);
}
catch (IOException e)
{
logger.error(MessageFormat.format("Couldn't parse text:"
+ System.lineSeparator() + "{0}"
+ System.lineSeparator(), text)
, e);
}
List<String> listOfGrammarMistakes = matches.stream()
.filter(match -> !(match.getRule() instanceof SpellingCheckRule))
.map(match -> text.substring(match.getFromPos(), match.getToPos()))
.collect(Collectors.toList());
List<String> potentialSpellingMistakes = matches.stream()
.filter(match -> match.getRule() instanceof SpellingCheckRule)
.map(match -> text.substring(match.getFromPos(), match.getToPos()))
.collect(Collectors.toList());
listOfGrammarMistakes.addAll(cleanSpellingMistakes(potentialSpellingMistakes));
return listOfGrammarMistakes;
}
private void deactivateRulesByPredicate(JLanguageTool languageTool, Predicate<Rule> rulePredicate)
{
languageTool.getAllActiveRules().stream()
.filter(rulePredicate)
.map(Rule::getId)
.forEach(languageTool::disableRule);
}
private void addAcceptedTerms(JLanguageTool languageTool, List<String> wordsToIgnore)
{
List<String> fullDictionaryToIgnore = new ArrayList<>(preliminaryWordsToIgnore);
fullDictionaryToIgnore.addAll(wordsToIgnore);
if (!fullDictionaryToIgnore.isEmpty())
{
languageTool.getAllActiveRules()
.stream()
.filter(rule -> rule instanceof SpellingCheckRule)
.forEach(rule -> ((SpellingCheckRule) rule).acceptPhrases(fullDictionaryToIgnore));
}
}
private List<String> cleanSpellingMistakes(List<String> listOfPossibleSpellingMistakes)
{
return listOfPossibleSpellingMistakes.stream()
.filter(word -> !wordHelper.containNumbers(word))
.filter(word -> !names.isNameOrPlace(word))
.filter(word -> !acronym.isAcronym(word))
.filter(word -> !urlsCleaner.isURL(word))
.filter(word -> !urlsCleaner.isEmail(word))
.filter(word -> !word.contains("."))
.collect(Collectors.toList());
}
/**
* Determine if a given word is compound or not.
*
* @param word the word to check if it's a compound word or not.
* @return true if the given word is a compound word; false otherwise.
*/
public boolean isCompoundWord(String word)
{
if (Strings.isNullOrEmpty(word))
{
return false;
}
String wordLower = word.toLowerCase();
return IntStream.range(1, wordLower.length())
.mapToObj(index -> wordLower.substring(0, index) + " " + wordLower.substring(index, wordLower.length()))
.anyMatch(string -> checkPlainText(string, Collections.emptyList()).isEmpty());
}
dnaber
(Daniel Naber)
October 5, 2016, 7:31am
4
Your checkPlainText
method re-creates JLanguageTool
every time, so this won’t be very fast. Try creating and setting it up only once.
idanm
(Idan Morad)
October 6, 2016, 6:01am
5
I need it to be threadsafe, this is why JLanguageTool
is being re-creates every time.
idanm
(Idan Morad)
October 13, 2016, 7:05am
6
Is there a workaround for my problem?