Back to LanguageTool Homepage - Privacy - Imprint

[gl] Getting started writing rules for Galician


(Manuel Souto Pico) #1

Good day everyone:

I've seen for some time that Galician is looking for a maintainer, so I have decided to start writing rules for Galician, which is my native tongue. I think I have grasped the basic concepts, and before I continue reading the documentation, here come my first two rules:

<rule id="FAI_FRIO_OU_CALOR" name="fai frío/calor » vai frío/calor"> <!-- it's cold/warm -->
    <pattern>
        <token regexp='yes'>Fai(che)?</token>
        <token regexp='yes'>(frío|calor)</token>
    </pattern>
    <message>O frío (ou a calor) non se fai que xa vén feito.</message>
    <example correction='Vai frío'><marker>Fai frío</marker></example>
    <example>Vai frío</example>
    <example correction='Vai calor'><marker>Fai calor</marker></example>
    <example>Vai calor</example>
</rule>

<rule id="A_EFECTOS_DE" name="a efectos de » para os efectos de"> <!-- for the purpose of -->
    <pattern>
        <token regexp='yes'>(A(os)?|Ós)</token>
        <token>efectos</token>
        <token regexp='yes'>d(e|[oa]s?)</token>
    </pattern>
    <message>A expresión correcta é "para os efectos de"</message>
    <url>https://gl.wikipedia.org/wiki/Wikipedia:Erros_de_ortograf%C3%ADa_e_desviaci%C3%B3ns</url>
    <short>a efectos de » para os efectos de</short>
    <example correction="para os efectos de"><marker>Aos efectos da</marker> telefonía é coma un único país con código internacional.</example>
    <example>Para os efectos de</example>
    <example correction="para os efectos de"><marker>A efectos da</marker> telefonía é coma un único país con código internacional.</example>
    <example correction="para os efectos de"><marker>Ós efectos da</marker> telefonía é coma un único país con código internacional.</example>
</rule>

I am sending them both for the admin to update the official grammar.xml file (under <category id="CAT8" name="Fraseoloxía"><rulegroup id="LOCUCIÓNS" name="locucións e frases feitas">), and also to get feedback and to have a basis to ask a couple of basic questions.

Q1:

In the first rule I would have liked to create an optional token to account for the expression with article and without article:

Vai frío
Vai un frío
Vai calor
Vai unha calor

I have tried both

<token regexp='yes'>Fai(che)?</token>
<token regexp='yes'>(un(ha)? )?</token>
<token regexp='yes'>(frío|calor)</token>

and

<token regexp='yes'>Fai(che)?</token>
<token regexp='yes'>(un(ha)? )?(frío|calor)</token>

but none of them match the pattern with articles. Is there a way to match an optional token other than creating another rule with one more token?

Q2:

In the second rule, I wrote <token>d(e|[oa]s?)</token> to match both "de" (preposition) and "do/da/dos/das" (prep + article in it's four inflections combining number and gender). Is there a way to indicate a lemma (e.g. "de") rather than a form? In that way, the preposition "de" would be matched both when it's contracted and when it's not.

In the answer to these questions is in the rest of the documentation that I still haven't read, I'll be grateful if you simply could redirect me to the relevant section.

Thanks a lot for your help.
Manuel


(Tiago F. Santos) #2

Thank you for your interest. I have placed that announcement for some time, but it has been removed after years of neglect. I ended up assuming the maintenance on this release after significant work has been done on that module. You can check the work that has already been done on:

Having more hands on deck is always better. As a non-native, I will require urls for confirmation. Please add one to every rule you create.

For these rules, https://dubidasdogalego.wordpress.com/2017/03/23/hai-fai-vai/
will work, but I bet you can find better sources.
I will add them to the build when you update them with an url.

This can be done with:

<token min='0' regexp='yes'>un(ha)?</token>

It is possible with using postags, but contractions postags are not the easiest to work with. See:

http://wiki.languagetool.org/development-overview#toc0


(Manuel Souto Pico) #3

Hi Tiago,

Here comes my edited first rule including a URL to the relevant entry in the official prescriptive dictionary:

    <rule id="FAI_FRIO_OU_CALOR" name="fai frío/calor > vai frío/calor">
        <pattern>
            <token regexp='yes'>Fai(che)?</token>
            <token min='0' regexp='yes'>un(ha)?</token>
            <token regexp='yes'>(frío|calor)</token>
        </pattern>
        <message>O frío (ou a calor) non se fai que xa vén feito.</message>
        <url>http://academia.gal/dicionario/-/termo/busca/ir</url>
        <example correction='Vai frío'><marker>Fai frío</marker></example>
        <example>Vai frío</example>
        <example correction='Vai calor'><marker>Fai calor</marker></example>
        <example>Vai calor</example>
    </rule>

Please note that in many cases there might not be a URL that I can provide, but I might rely on paper reference like a grammar or similar, which might not be available online. In that case, is there an element that I can use? <ref>?

The second rule can stay as it is. I understand I could use <token postag="S.*" postag_regexp="yes" /> to match a preposition but it's not a particular POS that I would like to match, but a particular lemma. It's okay to leave the regex instead.

Is there a more efficient ways to add rules to the official grammar.xml in my language? Perhaps sending the file to you by email or uploading it somewhere?

Cheers, Manuel


(Tiago F. Santos) #4

Hi Manuel,

That was very fast. Thank you!
The academia.gal is indeed a great choice and has the reference to this example.

One thing that I have been doing in Portuguese and Galician is to always tag rules with a comment.
For example:

+  <rulegroup id="COLLOCATION_ERRORS_BOKOMARU" name="Common colocation errors">
+    <!-- Created by Nicholas Walker (Bokumaru), 2017-11-14 -->
+    <!-- https://forum.languagetool.org/t/en-english-collocation-rules-to-contribute-to-lt/2318 -->

Academic paper references can be placed in the comments, instead of a url, in addition to the credits.

There is also the inflected='yes' that you can use while dealing with lemmas, but:

de[de/NCMS000, de/SPS00]
da[de/SPS00:DA]
do[de/SPS00:DA]
dos[de/SPS00:DA]

The postags are incomplete. The DA (determiner) part does not have gender nor number information. If you do not require it for that rule, you may wish to try it.
The alternative is to rebuild the POS dictionary. I haven't yet done it, but it is on my TODO list. If LT Galician dictionary is close enough to vanilla Freeling, this might have been already added.

There is. Setup a git account, fork the Languagetool and commit to your project.
Then you can submit pull requests to the main project. This allows better version control and, after review, your commits will be integrated in the main branch.

If you can do it this way, I will wait for that pull request. If setting up git is too much trouble, you can post the rules here on the forum.

Welcome to the project. I look forward to seeing more commits.

Cheers,

Tiago

P.S. - Please, change the topic to [gl] Galician discussion thread, or something similar. Over time, it become hard to track things over several different threads, and relevant inputs get lost.


(Tiago F. Santos) #5

Hi Manuel,

I pushed moments ago a new rule that cover the majority of the cases reported on:

Any rule that is a simple replacement can be added there. After tonight's regression tests, I will trim it down, to remove false positives.
If you are building rules from there, I would suggest focusing on the lines that are commented out (they start with a #). Those rules require context logic to be applyed, so, XML rules will have to be created to cover those cases.
For the remaining cases, it is easier and faster to just enumerate options, as long as the list of combinations is not too big, e.g. :

a efectos das=para os efectos da
a efectos dos=para os efectos do
a efectos das=para os efectos das
a efectos dos=para os efectos dos
ós efectos das=para os efectos da
ós efectos dos=para os efectos do
ós efectos das=para os efectos das
ós efectos dos=para os efectos dos
aos efectos das=para os efectos da
aos efectos dos=para os efectos do
aos efectos das=para os efectos das
aos efectos dos=para os efectos dos


(Manuel Souto Pico) #6

I will add comments to my rules as you suggest to include credits and references.

I think I need to learn more to understand what you are saying about postags and the dictionary. I will get back to this when I finish reading the documentation.

I have a git account, but I'm not too familiar with forks, branches and pull requests. I'll shout if I get stuck.

Topic changed.

Cheers, Manuel


(Manuel Souto Pico) #7

Sorry, I need clarification about two things.

  • You suggest to enumerate options. Do you mean in the XML rule? Or where?
  • You mention to focus on rules that are commented out with #, but I don't see them. Where are they?

Thanks! Manuel


(Tiago F. Santos) #8

In the XML regexp are better, just like you did.
But the rule I pushed yesterday allows you to enumerate options while giving the appropriate suggestions for each case. That lowers the barrier for creating new rules. You can use the other text files for more specific messages. At the moment you can add this way barbarisms, redundancies, wordiness or Spanish terms.

FAI_FRIO_OU_CALOR can still be added, since it is more specific. Learn how to use inflections and generalize the 1st token with 'facer'.

<message>O frío (ou a calor) non se fai que xa vén feito.</message>

This is a barbarism typical of Portuguese people speaking Galician, and the sarcasm in the messages may rub the user in the wrong way.
It also needs a suggestion. Read the section about<suggestion> to know how to do it.

<example correction='Vai frío'><marker>Fai frío</marker></example>
<example correction='Vai calor'><marker>Fai calor</marker></example>

Also, this only compiles if the proper suggestion is given. In this case you need dynamic suggestions.

A_EFECTOS_DE is now fully covered by GL_WIKIPEDIA_COMMON_ERRORS.
Using <short> is a good practice, but it should be a summary of the message. This is achieved using suggestions.