Is it possible to extend the LT model to include new NLP constructs, which we can reference in XML?

jonathonherbert · April 21, 2021, 9:57am

As an example, we have a rule where it would be useful to match on the animacy of a noun, to determine whether we should abbreviate large quantities – for example, 1 million people and 1m albums sold are both correct.

Is there a way for us to extend LanguageTool to insert this data as something we could pattern match on – adding a new property to tokens, for example, so we can use them in rules? For example, in this case instead of having to maintain a list of animate things, as in

    <pattern>
        <marker><token regexp="yes" skip="1">(\d[\d.]*)m</token></marker>
        <token regexp="yes">(adults|Americans|animals|cats ... long list of animate things here)</token>
        <message>million: in copy use m for sums of money, units or inanimate objects, otherwise million<suggestion><match no="1" regexp_match="(\d[\d.]*)m" regexp_replace="$1"/> million</suggestion></message>**?
    </pattern>

we could add our custom annotations and write something like

    <pattern>
        <marker><token regexp="yes" skip="1">(\d[\d.]*)m</token></marker>
        <token custom-namespace_is-animate="true"></token>
        <message>million: in copy use m for sums of money, units or inanimate objects, otherwise million<suggestion><match no="1" regexp_match="(\d[\d.]*)m" regexp_replace="$1"/> million</suggestion></message>**?
    </pattern>

Alternatively, we could work towards contributing something to the core of LanguageTool, if the maintainers felt that was useful – but it would be extra processing effort that many users would likely not benefit from, which makes it feel like a natural candidate for an interface and a user-supplied extension.

dnaber · April 22, 2021, 4:20pm

Hi Jonathon, the POS tags in LT are basically just strings. You can modify these strings or add new tags by using the disambiguator, as described at Developing a Disambiguator | dev.languagetool.org