Back to LanguageTool Homepage - Privacy - Imprint

Detecting words with only on (any) postag

(Ruud Baars) #1

To try to improve the disambiguator, I am looking for a way to find tokens that have only one postag, not matter which one. Is there a good trick for that?

(Daniel Naber) #2

You mean inside disambiguation.xml, or do you need a list of those words in a text file?

(Ruud Baars) #3

I need to detect them inside a rule.

(Daniel Naber) #4

Please try

(Ruud Baars) #5

That is an option when the postag can be specified... In my case the tag does not matter, as long as it is the only one.

If it cannot be done, I will try outside of LT.

(Jan Schreiber) #6

I could not test it, but I would expect the following to work, assuming all tags consist of exactly three uppercase letters without special characters:
<token postag="[A-Z]{3}" postag_regexp="yes"/>
EDIT: .{1,3} might work better.

(Lodewijk Arie van Brienen) #7

I've been dealing with something similar with a (potential) video game. (not bothering with inter-filter code when only one filter is used)

the current solution I use is to have a variable that tracks the number of filters used and switch to a secondary routine when one or zero filters are in use)

in this case you could call it postagCount.

(Ruud Baars) #8

I am one of the few maintainers without Java skills. I just edit the rule files.
The bit of programming I do is not in the LT ecosystem, but outside in php.

(Lodewijk Arie van Brienen) #9

I was hoping my idea could add an extra handle you could use.
while it would increase the amount of data needing to be stored per token,
it would allow you to use <token postagcount="1"> in a rule.

another option would be to make literal usage of postag1 a magic 'number' pointing to the first postag of a token.

(Ruud Baars) #10

Is postag1 really a special keyword? Never expected that, even though there is text about it in the wiki. The other special tags are specigically uppercased....

I will give it a try!

(Lodewijk Arie van Brienen) #11

not that I know, I meant it as a future possibility.

(Jan Schreiber) #12

@Ruud_Baars, have you tried my suggestion above? That would probably be the easiest way.

(Ruud Baars) #13

The tags are not of a fixed size. And I don't understand the trick. I don't get why it should catch any tag that is the only one for that word. But I coul try tomorrow.

(Jan Schreiber) #14

Okay, that's a problem. But you still can use the fact (if it is a fact) that one-tag words do not have a tag separator. Assuming that all tags consist of alphanumeric chars only and that the tag separator is not alphanumeric (such as a colon), the following regex should work:
<token postag="^\w+$" postag_regexp="yes"/>
Basically you can replace \w with any regular expression that matches only characters that can occur in a tag.