Not identified words

marcoagpinto · December 14, 2018, 9:09am

Hello!

Tested with the Portuguese speller only so far:

Using LibreOffice 6.2 beta 1, the words which are in the speller but which don’t have morphologic information appear underlined in blue.

When we right-click on them it says they are unidentified.

Could the underline have a different colour, such as pink?

This would help in the task of adding all the unidentified words to the added.txt .

Thank you!

Kind regards,

Yakov · December 14, 2018, 12:35pm

Yes, It’s a new feature of LibreOffice 6.2

marcoagpinto · December 14, 2018, 12:45pm

Yakov, what I meant was “could someone implement it into LanguageTool so that I can test full documents looking for words without morphological data?”

I tried to find it in the settings but failed.

There are colours there for a lot of analyses but I didn’t find for this one.

Thanks!

Yakov · December 14, 2018, 1:51pm

Marco,
You can use rule like this for that:

        <rulegroup default="on" id="Unknown_words1" name="Unknown_words1">
            <rule>
	            <antipattern>
		            <marker>
			            <token case_sensitive="yes" regexp="yes">[A-Z][A-Z]+</token>
		            </marker>
	            </antipattern>
	            <antipattern>
	                    <token postag="UNKNOWN"
	                               regexp="yes">[a-z]+</token>
	                    <token>.</token>
	            </antipattern>
                <pattern>
                    <marker>
	                    <token postag="UNKNOWN"
	                           regexp="yes">[a-z]+</token>
                    </marker>
                    <token></token>
                </pattern>
                <message>Unknown word, missing postag</message>
                <short>Unknown word</short>
                <example correction="">This is <marker>teeeeeest</marker>.</example>
            </rule>
        </rulegroup>

You only need to specify in the regexp what symbols used.

marcoagpinto · December 14, 2018, 2:18pm

@Yakov

It appears as blue, and only with a right-click it says it is unidentified.

The whole idea was to get also a setting in LT to change the colour of the underline for this case.

For example, if I could change it to pink, I would know just by looking at the document that I needed to add postags (valid words but not with morphological information).

Ruud_Baars · December 14, 2018, 4:31pm

It is possible to get the speller and postag data and list the words that are in the speller, not in postag database. From a corpus, word frequencies could be added. Won’t that be easier?

Yakov · December 14, 2018, 4:43pm

You may put this rule in a separate category.
And for this category in the user interface you may set the pink colour.
(Tools-LanguageTool-Options-Underline Color of Category and set colour for this category).

tiagosantos · December 14, 2018, 10:03pm

@marcoagpinto
Portuguese has a category described as ‘Desenvolvimento’, I added it some time ago, which is the one responsible for those detection. You can find it in Grammar tab, in inside the Options, which I believe you did, since that category is disabled by default.
If you wish to develop the tagger, you can use the standalone tool and change the color of that category for whatever you which, also in the Options, inside a tab conveniently named ‘Underline Color/Cor do Sublinhado’.

marcoagpinto · December 14, 2018, 10:17pm

@tiagosantos

It is working!!!

On Monday I will try to dedicate as much time as possible to add postags!

Thank you!

marcoagpinto · December 20, 2018, 11:36pm

Could you e-mail it to me in .txt format so that I can work on it in the next three months?

Thanks!

Ruud_Baars · December 21, 2018, 7:30am

I don’t have it lying around, but I will start making this.

Ruud_Baars · December 21, 2018, 7:33am

By the way; this is about pt_PT, right? Doing it for all those variants is quite a bit of work…

marcoagpinto · December 21, 2018, 8:22am

Yes, pt_PT

Ruud_Baars · December 21, 2018, 8:48am

I am almost done. To my surprise, the pt Hunspell is very tolerant. At first, it allows for - to break the word. So every word consisting of valid parts separated by - are accepted. That is rather tolerant.
So you will find words like --a and a-- as valid.
I used words found in my collection of PT texts for the frequency; I unmunched Hunspell for the max amount of tolerated words.
I dumped the postag dictionary and used that to check if the words had a postag.

If you send me an email at info at taaltik.nl I will return the results zipped; it is about 200 MB of words and frequency numbers.

Ruud_Baars · December 21, 2018, 11:18am

You can download it here: taaltik.xs4all.nl/POR/pt_accepted_no_tag.zip
Please inform me when you did; I will remove it then.

marcoagpinto · December 21, 2018, 1:42pm

Hello!

I have just downloaded it.

Thanks!

Ruud_Baars · December 22, 2018, 12:06pm

By the way… don’t be surprised when there are some words in it that are incorrect. I used Hunspell -G to list the correct words; that has a bug that also lists the parts of corrects words having a - and are correct as a whole, but the part is not.
If you want those to be gone, you can perform a Hunspell -L -d pt_PT on the list to remove those.

marcoagpinto · January 10, 2019, 4:33pm

@tiagosantos

Hello!

The following symbols give not identified words in Portuguese, at least in LibreOffice and I noticed the last in MS Word:
∑ ≠ →

How do I add them?

Thanks!

tiagosantos · January 10, 2019, 8:11pm

@marcoagpinto

Have you already finished the postag task? I have only seen a dozen of new postags added in a handful of commits.

marcoagpinto · January 10, 2019, 8:22pm

I have been doing it very slowly, I know

So many things going on at the same time.