Add words to vocabulary at run-time

benjaminvdb · March 16, 2018, 11:16am

I’d like the users of my application to add words to the Dutch vocabulary at run-time through an HTTP post request to LT Server. Implementing adding to the ignored wordlist was pretty easy, but adding to the vocabulary (so I can provide replacement suggestions) is giving me headaches.

As a proof-of-concept I’ve hacked my way around the LanguageTool codebase to get this to work. I’ve noticed that the vocabulary is compiled into a binary format from the spelling.txt file that’s included in the language module for Dutch.

I’ve replaced loading spelling.txt from the resource path with a regular file on my local filesystem and I’ve added a call to LanguageToolHttpHandler that allows me to update this file. (I know this is hacky, but it seemed like an easy way to verify this idea.) While LT is able to load and use the wordlist this way, new words that are added to it using the added call to LanguageToolHttpHandler are not picked up. That is, LT does loop over the updated list, but it seems to ignore the new words.

I’ve checked whether there is some kind of caching involved that keeps me from getting the desired behavior. I’ve removed the cache in CachingWordListLoader, and simply force reloading from the filesystem on each call to loadWordsFromPath(). I’ve also disabled JLanguageTool’s internal ResultCache and the LoadingCache named dictCache in MorfologikMultiSpeller, but I’m still not getting the desired behavior.

I think LT is written in a way that doesn’t allow this because of performance implication. However, if we assume that performance isn’t an issue for me, I would be very happy to get some pointers that would help me getting this implemented. Thanks!

dnaber · March 16, 2018, 12:53pm

Are you working in a public fork? If so, you could maybe post its URL here. Just in case someone wants to have a closer look.

Ruud_Baars · March 17, 2018, 6:37am

What you are proposing can be very dangerous. The speller is designed to only accept words that are correct according to the official spelling by the Taalunie. Adding any word will reduce spelling preciseness a lot.

Apart form the technical stuff, what are you trying to achieve for the user?

benjaminvdb · March 19, 2018, 12:25pm

Hi Ruud! Thanks for your answer.

Our instance of LanguageTool is operating in a medical environment with its own jargon that LT doesn’t know about. We’d like our colleagues to add technical words to the vocabulary. These words will be instantly available, but also manually corrected by us in batches of 100 words or so.

Personally I think this isn’t too dangerous and something to at least try out. An unmodified LT just won’t do in our scenario.

Ruud_Baars · March 19, 2018, 12:31pm

Medical jargon would be a good addition to the words list. It could be filtered from our available corpus as well, then judged (would only need some words as a ‘seed’). It would even be possible to collect the words from documents in the organization. I could help your organization with that for a reasonable fee.

Ruud_Baars · March 20, 2018, 9:52am

Anyway, since feedback about Dutch LT is as good as zero, I would appreciate to get all feedback there is, the more specific the better.
Valuable feedback could be:

the added words (after check)
the rules switched off by users and the count of it
any user feedback

It dos not have to be on the forum. Feedback could also be sent directly to info@taaltik.nl

dnaber · March 20, 2018, 10:39am

Here are the most ignored rules for Dutch:

+-----+------------------------------------+
| ct  | rule_id                            |
+-----+------------------------------------+
| 127 | LANG_WOORD                         |
| 127 | VAAG                               |
|  82 | TOO_LONG_SENTENCE                  |
|  69 | EINDE_ZIN_ONVERWACHT               |
|  66 | WIJ_ZIJ_MIJ                        |
|  53 | GETALLEN_0_20_OF_ROND              |
|  51 | OVERDRIJVING                       |
|  49 | KOMMA_HL                           |
|  39 | EN_EN_EN                           |
|  34 | OPTIONAL_HYPHEN                    |
|  31 | LANG_HL                            |
|  29 | MEER_DAN_50_WOORDEN                |
|  27 | KORT_2                             |
|  26 | XXX_DING                           |
|  26 | TE_VREEMD                          |
|  22 | NL_SIMPLE_REPLACE                  |
|  20 | DEELTEKEN                          |
|  19 | UPPERCASE_SENTENCE_START           |
|  18 | OP_HET_GEBIED_VAN                  |
|  18 | NL_PREFERRED_WORD_RULE_INTERNAL    |
|  18 | KOMMA_ONTBR                        |
|  16 | UNPAIRED_BRACKETS                  |
|  16 | ERVARING                           |
|  15 | KORT_1                             |
|  13 | AAN_DE_HAND_VAN                    |
|  13 | WAT_BETREFT                        |
|  12 | CREËREN                            |
|  12 | OP_BASIS_VAN                       |
|  12 | DUBBEL_WOORD                       |
|  11 | ONTKEN_2                           |
|  10 | OPTIMAAL                           |
|  10 | ERG_LANG_WOORD                     |
|  10 | AANGEZIEN                          |
|   9 | KOMMA_DAT                          |

Ruud_Baars · March 20, 2018, 10:53am

The rules seem to be okay. But is is sometimes a very strict approach. I could move those rule to a category ‘check as strict as you can’ and set that to off for default. But that would not be improving the checks.

I could also split up some of the rules, to see more exactly what is considered ‘unnecessary’ as a warning.

By the way, when the user click to disable the rule, what is exactly disabled? The category, the rule group, or the rule?

The marked area which caused the user to switch the rule off could be of help. As far as I can see, some rules were switched off that seem to be working great. maybe the user clicks the rule away on some exception that was not found before…

benjaminvdb · March 20, 2018, 12:46pm

I’m not entirely sure how this is relevant to my question as I’m just trying to figure out how to replicate the behavior of adding words to spelling.txt, but then at run-time. I don’t think any rules need to be disabled.

benjaminvdb · March 22, 2018, 2:42pm

I went up with the most hacky solution of the century, but it works. I’ve written a thin wrapper in Python that launches multiple instances of LanguageTool HTTPServer and always exposes exactly one at any given time. If words are added through the ‘add’ endpoint it exposes - which writes to the appropriate spelling.txt - it will temporarily point to another instance, while the old instance is restarted with the updated vocabulary.

This requires no code modifications in LanguageTool and seems to work well.