Back to LanguageTool Homepage - Privacy - Imprint

Not identified words


(Marco A.G.Pinto) #1

Hello!

Tested with the Portuguese speller only so far:

Using LibreOffice 6.2 beta 1, the words which are in the speller but which don’t have morphologic information appear underlined in blue.

When we right-click on them it says they are unidentified.

Could the underline have a different colour, such as pink?

This would help in the task of adding all the unidentified words to the added.txt .

Thank you!

Kind regards,


(Yakov) #2

Yes, It’s a new feature of LibreOffice 6.2


(Marco A.G.Pinto) #3

Yakov, what I meant was “could someone implement it into LanguageTool so that I can test full documents looking for words without morphological data?”

I tried to find it in the settings but failed.

There are colours there for a lot of analyses but I didn’t find for this one.

:frowning:

Thanks!


(Yakov) #4

Marco,
You can use rule like this for that:

        <rulegroup default="on" id="Unknown_words1" name="Unknown_words1">
            <rule>
	            <antipattern>
		            <marker>
			            <token case_sensitive="yes" regexp="yes">[A-Z][A-Z]+</token>
		            </marker>
	            </antipattern>
	            <antipattern>
	                    <token postag="UNKNOWN"
	                               regexp="yes">[a-z]+</token>
	                    <token>.</token>
	            </antipattern>
                <pattern>
                    <marker>
	                    <token postag="UNKNOWN"
	                           regexp="yes">[a-z]+</token>
                    </marker>
                    <token></token>
                </pattern>
                <message>Unknown word, missing postag</message>
                <short>Unknown word</short>
                <example correction="">This is <marker>teeeeeest</marker>.</example>
            </rule>
        </rulegroup>

You only need to specify in the regexp what symbols used.


(Marco A.G.Pinto) #5

@Yakov

It appears as blue, and only with a right-click it says it is unidentified.

The whole idea was to get also a setting in LT to change the colour of the underline for this case.

For example, if I could change it to pink, I would know just by looking at the document that I needed to add postags (valid words but not with morphological information).


(Ruud Baars) #6

It is possible to get the speller and postag data and list the words that are in the speller, not in postag database. From a corpus, word frequencies could be added. Won’t that be easier?


(Yakov) #7

You may put this rule in a separate category.
And for this category in the user interface you may set the pink colour.
(Tools-LanguageTool-Options-Underline Color of Category and set colour for this category).


(Tiago F. Santos) #8

@marcoagpinto
Portuguese has a category described as ‘Desenvolvimento’, I added it some time ago, which is the one responsible for those detection. You can find it in Grammar tab, in inside the Options, which I believe you did, since that category is disabled by default.
If you wish to develop the tagger, you can use the standalone tool and change the color of that category for whatever you which, also in the Options, inside a tab conveniently named ‘Underline Color/Cor do Sublinhado’.


(Marco A.G.Pinto) #9

@tiagosantos

It is working!!!

On Monday I will try to dedicate as much time as possible to add postags!

Thank you!

:slight_smile:


(Marco A.G.Pinto) #10

Could you e-mail it to me in .txt format so that I can work on it in the next three months?

Thanks!


(Ruud Baars) #11

I don’t have it lying around, but I will start making this.


(Ruud Baars) #12

By the way; this is about pt_PT, right? Doing it for all those variants is quite a bit of work…


(Marco A.G.Pinto) #13

Yes, pt_PT :slight_smile:


(Ruud Baars) #14

I am almost done. To my surprise, the pt Hunspell is very tolerant. At first, it allows for - to break the word. So every word consisting of valid parts separated by - are accepted. That is rather tolerant.
So you will find words like --a and a-- as valid.
I used words found in my collection of PT texts for the frequency; I unmunched Hunspell for the max amount of tolerated words.
I dumped the postag dictionary and used that to check if the words had a postag.

If you send me an email at info at taaltik.nl I will return the results zipped; it is about 200 MB of words and frequency numbers.


(Ruud Baars) #15

You can download it here: taaltik.xs4all.nl/POR/pt_accepted_no_tag.zip
Please inform me when you did; I will remove it then.


(Marco A.G.Pinto) #16

Hello!

I have just downloaded it.

Thanks!


(Ruud Baars) #17

By the way… don’t be surprised when there are some words in it that are incorrect. I used Hunspell -G to list the correct words; that has a bug that also lists the parts of corrects words having a - and are correct as a whole, but the part is not.
If you want those to be gone, you can perform a Hunspell -L -d pt_PT on the list to remove those.


(Marco A.G.Pinto) #18

@tiagosantos

Hello!

The following symbols give not identified words in Portuguese, at least in LibreOffice and I noticed the last in MS Word:
∑ ≠ →

How do I add them?

Thanks!


(Tiago F. Santos) #19

@marcoagpinto

Have you already finished the postag task? I have only seen a dozen of new postags added in a handful of commits.


(Marco A.G.Pinto) #20

I have been doing it very slowly, I know :frowning:

So many things going on at the same time.