Back to LanguageTool Homepage - Privacy - Imprint

isCompoundWord using JLanguageTool


(Idan Morad) #1

Hi,

I created the method:
public boolean isCompoundWord(String word)
using JLanguageTool. first I deactivated any rule that isn't related to spell checking, and then I try to split the word at any char and see if the one of the words isn't a spelling mistake. The running time for this method takes a lot of time of a text of 9 sentences vs. just check all the sentences together. Is there a simple way to implement this function or a way to just use the JLanguageTool English dictionary and check if a word exists?


(Daniel Naber) #2

How exactly did you implement isCompoundWord? Could you post the code?


(Idan Morad) #3
@Component
public class SpellChecker
{
    protected final static Logger logger = LoggerFactory.getLogger(SpellChecker.class);
    private final Language language;
    private final UrlsCleaner urlsCleaner;
    private final Names names;
    private final WordHelper wordHelper;
    private final Acronym acronym;
    private final HtmlTagsRemoval htmlTagsRemoval;
    private final List<String> preliminaryWordsToIgnore;
    private Predicate<Rule> rulePredicate;

    public SpellChecker(Language language, UrlsCleaner urlsCleaner, Names names, WordHelper wordHelper,
                        Acronym acronym, HtmlTagsRemoval htmlTagsRemoval, List<String> wordsToIgnore,
                        Predicate<Rule> rulePredicate)
    {
        this.language = language;
        this.urlsCleaner = urlsCleaner;
        this.names = names;
        this.wordHelper = wordHelper;
        this.acronym = acronym;
        this.htmlTagsRemoval = htmlTagsRemoval;
        this.rulePredicate = rulePredicate;
        this.preliminaryWordsToIgnore = wordsToIgnore;
    }

    public SpellChecker(Language language, UrlsCleaner urlsCleaner, Names names, WordHelper wordHelper,
                        Acronym acronym, HtmlTagsRemoval htmlTagsRemoval, List<String> wordsToIgnore)
    {
        this(language, urlsCleaner, names, wordHelper, acronym, htmlTagsRemoval, wordsToIgnore, rule -> false);
    }

    /**
     * Prof-reading a given text and return the mistakes found for plain text.
     *
     * @param text          the text to check.
     * @param wordsToIgnore array of words to ignore from spelling check (usually names). can be empty.
     * @return list of words with spelling or grammar mistakes.
     */
    public List<String> checkPlainText(String text, List<String> wordsToIgnore)
    {
        JLanguageTool languageTool = new JLanguageTool(language);

        addAcceptedTerms(languageTool, wordsToIgnore);

        deactivateRulesByPredicate(languageTool, rulePredicate);

        List<RuleMatch> matches = new ArrayList<>();
        try
        {
            matches = languageTool.check(text);
        }
        catch (IOException e)
        {
            logger.error(MessageFormat.format("Couldn't parse text:"
                            + System.lineSeparator() + "{0}"
                            + System.lineSeparator(), text)
                    , e);
        }

        List<String> listOfGrammarMistakes = matches.stream()
                .filter(match -> !(match.getRule() instanceof SpellingCheckRule))
                .map(match -> text.substring(match.getFromPos(), match.getToPos()))
                .collect(Collectors.toList());

        List<String> potentialSpellingMistakes = matches.stream()
                .filter(match -> match.getRule() instanceof SpellingCheckRule)
                .map(match -> text.substring(match.getFromPos(), match.getToPos()))
                .collect(Collectors.toList());

        listOfGrammarMistakes.addAll(cleanSpellingMistakes(potentialSpellingMistakes));

        return listOfGrammarMistakes;
    }

    private void deactivateRulesByPredicate(JLanguageTool languageTool, Predicate<Rule> rulePredicate)
    {
        languageTool.getAllActiveRules().stream()
                .filter(rulePredicate)
                .map(Rule::getId)
                .forEach(languageTool::disableRule);
    }

    private void addAcceptedTerms(JLanguageTool languageTool, List<String> wordsToIgnore)
    {
        List<String> fullDictionaryToIgnore = new ArrayList<>(preliminaryWordsToIgnore);
        fullDictionaryToIgnore.addAll(wordsToIgnore);

        if (!fullDictionaryToIgnore.isEmpty())
        {
            languageTool.getAllActiveRules()
                    .stream()
                    .filter(rule -> rule instanceof SpellingCheckRule)
                    .forEach(rule -> ((SpellingCheckRule) rule).acceptPhrases(fullDictionaryToIgnore));
        }
    }

    private List<String> cleanSpellingMistakes(List<String> listOfPossibleSpellingMistakes)
    {
        return listOfPossibleSpellingMistakes.stream()
                .filter(word -> !wordHelper.containNumbers(word))
                .filter(word -> !names.isNameOrPlace(word))
                .filter(word -> !acronym.isAcronym(word))
                .filter(word -> !urlsCleaner.isURL(word))
                .filter(word -> !urlsCleaner.isEmail(word))
                .filter(word -> !word.contains("."))
                .collect(Collectors.toList());
    }

    /**
     * Determine if a given word is compound or not.
     *
     * @param word the word to check if it's a compound word or not.
     * @return true if the given word is a compound word; false otherwise.
     */
    public boolean isCompoundWord(String word)
    {
        if (Strings.isNullOrEmpty(word))
        {
            return false;
        }

        String wordLower = word.toLowerCase();

        return IntStream.range(1, wordLower.length())
                .mapToObj(index -> wordLower.substring(0, index) + " " + wordLower.substring(index, wordLower.length()))
                .anyMatch(string -> checkPlainText(string, Collections.emptyList()).isEmpty());
    }

(Daniel Naber) #4

Your checkPlainText method re-creates JLanguageTool every time, so this won't be very fast. Try creating and setting it up only once.


(Idan Morad) #5

I need it to be threadsafe, this is why JLanguageTool is being re-creates every time.


(Idan Morad) #6

Is there a workaround for my problem?