Encoding Font-Information or a specific Token-Type for Rule

Stefan_Falk · September 23, 2015, 1:37pm

Okay that title is kind of vague but what I want is the following:

I’m having the information that V_{BAT}^{2} (LaTeX-Style) is a declared symbol in a table. I store those symbols in Map declaredSymbols.

Now, in my rule I would like to check if someone uses a symbol in a sentence e.g.

“Apply power V_{BAT}^{2} to pin VBAT2”

if that symbol does exist in declaredSymbols.

My problem at the moment is that I am not sure how I could use LanguageTool in order to accomplish that. I have two ideas:

Idea 1)

I can create the string “Apply power V BAT 2 to pin VBAT2” and apply POS-Tags to the tokens that describe their relation

“V” - DECORATED_TOKEN, DECORATIONS {[from=a, to=b], [from=c, to=d] }
“BAT” - DECORATION_TOKEN, SUBSCRIPT, [of_token [from=x, to=y]]
“2” - DECORATION_TOKEN, SUPERSCRIPT, [of_token [from=x, to=y]]

but that can get dirty e.g. if I have decorations which can be on the left side e.g. ²V³ besides that it is a misuse of POS-Tags.

Idea 2)

I can produce LaTeX-like strings that have escape characters e.g.
“Apply power $$$V_{BAT}^{2}$$$ to pin VBAT2”

and let Symbol produce such a string e.g.

if(analyzedToken.equals(declaredSymbols.get(key).toLaTeXSyntax()) == true) { /* … */ }

but I am not sure if that is a good idea either. I would have to modify the document text and I am not sure if LanguageTool will produce me an AnalyzedToken that contains exactly “$$$V_{BAT}^{2}$$$”.

So… is there a way to do this with LangaugeTool or will I have to go a completely different road?

Thank you for any help!

Best regards, Stefan.

dnaber · September 23, 2015, 1:59pm

The AnalyzedSentence object you get in the match() method also has a getText() method to get the original text. I wonder if you could just get the sentence text and search for a regular expression. That seems easier than dealing with tokens in this case.

Stefan_Falk · September 25, 2015, 10:33am

Yes, that would be Idea 2). To ensure that I do not mistake symbols and/or other names I’d have to use some escape characters. That would mean that I create my sentences like this:

“Apply power $$$V_{BAT}$$$ to pin $$$VBAT$$$ of the RISC controller.”

instead of

“Apply power V BAT to PIN VBAT of the RISC controller.”

I can do that because in my previous step every single Token is represented as a complex object that contains its formatting information e.g.

V [id=#1; decorated_by=[#2]]
BAT [id=#2; type=subscript]
VBAT [id=#3]
RISC[id=#4]

which would allow me to do that and where “decorated_by” would mean that the produces string should be escaped with $$$…$$$ (or something else).