Processing of multiple custom rules

pgrier · June 11, 2024, 8:22pm

I am new to using LanguageTool and have been enjoying the experience while developing a custom style guide. I have encountered an issue with my custom rules concerning the spacing around en dashes.

I have created two separate rules: one to ensure a space after an en dash and another to ensure a space before it. My expectation is that both rules would trigger independently based on their conditions. However, I am observing that only one rule triggers at a time depending on the sequence of checks.

Here are the rules for reference:

<rule id="NO_SPACE_AFTER_ENDASH" name="Space After En Dash">
    <pattern>
        <token spaceafter="no" min="0">–</token>
    </pattern>
    <message>Insert a space after the en dash.</message>
</rule>

<rule id="NO_SPACE_BEFORE_ENDASH" name="Space Before En Dash">
    <pattern>
        <token spacebefore="no" min="0">–</token>
    </pattern>
    <message>Insert a space before the en dash.</message>
</rule>

For example, in the sentence “The years of World War II were 1939–1945”, no spaces are around the en dash. I would expect both rules to trigger, indicating that spaces are needed both before and after the en dash. However, only the first rule triggers unless I manually adjust the sentence to add a leading space and run the check again, then the second rule triggers.

Could you please advise if there is a way to configure these rules so that both conditions are checked and reported simultaneously? This would streamline the editing process significantly I have already the rule rewritten as one but I am more concerned with better understanding the processing. Thanks for any help

dnaber · June 11, 2024, 8:24pm

Do they both underline the en dash? Then one rule will be filtered out, as most user interfaces cannot show two underlines at the same place. The only solution is, I think, to write a single rule instead.

pgrier · June 11, 2024, 8:40pm

I am not sure what you mean by underline? The rule should identify that “1939–1945” needs a space in front and behind the en dash (the line – used to show ranges in numbers and dates). it should be “1939 – 1945”. I have already solved this rule issue I am more concerned with why both rules don’t trigger a match during processing.

dnaber · June 11, 2024, 8:54pm

Each rule underlines the part of the text that it considers incorrect (<marker> to </marker> in the rule). It sounds like one of your rules might underline 1939– while the other underlines –1945 and that will cause one match to be removed internally by LT, as – could be underlined by both rules, and LT doesn’t support that.

pgrier · June 11, 2024, 9:28pm

Sorry, if I’m confusing the matter as I said I am now to Languagetool . I am using the tool to generate a list of issues with the provided text . I am not using <marker> . WHen I call the tool the output is just a list of issues like this :

Issue: NO_SPACE_BEFORE_ENDASH - Insert a space before the en dash.
Context: The years of World War II were 1939–1945

if I add a space infront of the en dash and run the check again is gives this:

Issue: NO_SPACE_AFTER_ENDASH - Insert a space after the en dash.
Context: The years of World War II were 1939 –1945

I would have expected the check to have matched both rules and reported both issues on the first check. in case it helps this is the basic code I am using :

import language_tool_python
tool = language_tool_python.LanguageTool(‘en-US’, remote_server=‘http://localhost:8081’)
text = “The years of World War II were 1939 –1945”
tool.language = ‘en’
matches = tool.check(text)
for match in matches:
print(f"Issue: {match.ruleId} - {match.message}“)
print(f"Context: {match.context}\n”)
tool.close()

dnaber · June 11, 2024, 9:35pm

Both rules match the dash. As I said, LT internally filters matches that match the same place in the text, the dash in this case, to avoid duplicates.

pgrier · June 12, 2024, 7:43am

I appreciate the insights, it helps answer some other questions I have. Just to clarify the internal LT processor does not take into account the spacebefore=“no” and spaceafter=“no” when identifying if a rule is a duplicate? and there is no way to force a rule to be processed regardless of if it is a duplicate?

dnaber · June 12, 2024, 7:53am

No.

Not when using the http server. The command-line version of LT will show both matches.