Detecting words with only on (any) postag

Ruud_Baars · September 4, 2017, 10:35am

To try to improve the disambiguator, I am looking for a way to find tokens that have only one postag, not matter which one. Is there a good trick for that?

dnaber · September 4, 2017, 6:21pm

You mean inside disambiguation.xml, or do you need a list of those words in a text file?

Ruud_Baars · September 4, 2017, 7:00pm

I need to detect them inside a rule.

dnaber · September 4, 2017, 7:35pm

Please try Tips and Tricks - LanguageTool Wiki

Ruud_Baars · September 5, 2017, 4:50am

That is an option when the postag can be specified… In my case the tag does not matter, as long as it is the only one.

If it cannot be done, I will try outside of LT.

Jan_Schreiber · September 5, 2017, 1:37pm

I could not test it, but I would expect the following to work, assuming all tags consist of exactly three uppercase letters without special characters:
<token postag="[A-Z]{3}" postag_regexp="yes"/>
EDIT: .{1,3} might work better.

SkyCharger001 · September 5, 2017, 4:18pm

I’ve been dealing with something similar with a (potential) video game. (not bothering with inter-filter code when only one filter is used)

the current solution I use is to have a variable that tracks the number of filters used and switch to a secondary routine when one or zero filters are in use)

in this case you could call it postagCount.

Ruud_Baars · September 5, 2017, 4:58pm

I am one of the few maintainers without Java skills. I just edit the rule files.
The bit of programming I do is not in the LT ecosystem, but outside in php.

SkyCharger001 · September 5, 2017, 5:46pm

I was hoping my idea could add an extra handle you could use.
while it would increase the amount of data needing to be stored per token,
it would allow you to use <token postagcount="1"> in a rule.

another option would be to make literal usage of postag1 a magic ‘number’ pointing to the first postag of a token.

Ruud_Baars · September 5, 2017, 5:56pm

Is postag1 really a special keyword? Never expected that, even though there is text about it in the wiki. The other special tags are specigically uppercased…

I will give it a try!

SkyCharger001 · September 5, 2017, 6:03pm

not that I know, I meant it as a future possibility.

Jan_Schreiber · September 5, 2017, 6:11pm

@Ruud_Baars, have you tried my suggestion above? That would probably be the easiest way.

Ruud_Baars · September 5, 2017, 7:27pm

The tags are not of a fixed size. And I don’t understand the trick. I don’t get why it should catch any tag that is the only one for that word. But I coul try tomorrow.

Jan_Schreiber · September 5, 2017, 9:09pm

Okay, that’s a problem. But you still can use the fact (if it is a fact) that one-tag words do not have a tag separator. Assuming that all tags consist of alphanumeric chars only and that the tag separator is not alphanumeric (such as a colon), the following regex should work:
<token postag="^\w+$" postag_regexp="yes"/>
Basically you can replace \w with any regular expression that matches only characters that can occur in a tag.