Common foreign phrases

Ruud_Baars · April 5, 2018, 8:05am

Dutch is under very high pressure from English, unfortunately. making all the incorporated English words a spelling error is irritating. However, not pointing out the often unnecessary foreign phrases is not good as well.
These phrases are prone to cause problems in tagging, thus creating false positives.
Of course, word groups can be immunized in the disambiguator for spelling, rules can be created to warn for unnecessary foreign language, and words could be tagged as ‘foreign’ as well.

Did anyone else think of an easy and structural way to handle this?

Mike_Unwalla · July 25, 2018, 8:18am

Hi @Ruud_Baars, sorry for the late reply. Here’s a suggestion for a structured method (but I don’t know whether it is easy).

For each language, add a text file ‘foreign-phrases.txt’.

Put the foreign phrases (one-word or multi-word) into the file.

Add a rule (possibly in Java) ‘Do not ignore foreign phrases’:

If the rule is selected (the default), LT shows a spelling error (possibly a different colour to the standard colour for spelling errors) and gives a message ‘do not use this foreign phrase’.
If the user deselects the rule, LT ignores the spelling of foreign phrases and does not give a message.

Optional enhancements uses a tab-separated file. The message can use the additional information:

foreign phrase, source language
foreign phrase, source language, possible non-foreign alternatives

A structured method to deal with foreign phrases would be useful for English.

Ruud_Baars · July 25, 2018, 9:36am

Something like that. But so far, we will have to do without a Java programmer. Until there is one, we will have to cope with xml solutions.

Mike_Unwalla · July 25, 2018, 11:07am

In disambugation.xml, add each term. Apply a postag to each word of each term. (English has a postag FW. If there is no equivalent in the Dutch postags, you could use any convenient name.) If you have a list of terms, you could quickly use regular expressions to create the XML.

In disambugation.xml, make a rule to the ignore spelling of words that have the foreign word postag.

In grammar.xml, make a rule that finds all the phrases that have the foreign word postag.

I did a test with English:

<rule id="FOREIGN_PERCENTUM" name="Foreign phrase: percentum">
  <pattern>
      <token>percentum</token>
  </pattern>
  <disambig action="replace"><wd pos="FOREIGN"/></disambig>
</rule>
<rule id="FOREIGN_DE_MINIMIS" name="Foreign phrase: de minimis">
  <pattern>
      <token>de</token>
      <token>minimis</token>
  </pattern>
  <disambig action="replace"><wd pos="FOREIGN"/><wd pos="FOREIGN"/></disambig>
</rule>
<rule id="FOREIGN_ISPSO_FACTO" name="Foreign phrase: ipso facto">
  <pattern>
      <token>ipso</token>
      <token>facto</token>
  </pattern>
  <disambig action="replace"><wd pos="FOREIGN"/><wd pos="FOREIGN"/></disambig>
</rule>
<rule id="IGNORE_SPELLING_OF_FOREIGN_PHRASES" name="Ignore spelling of foreign phrases">
    <pattern>
        <token postag="FOREIGN"/>
    </pattern>
    <disambig action="ignore_spelling"/>
</rule>

I could not make one grammar rule that finds one-word and multi-word terms. This rule finds 2-word terms.

<rule id="FOREIGN_PHRASES_2_WORD" name="Foreign phrases">
  <pattern>
    <marker>
      <token postag="FOREIGN"/>
      <token postag="FOREIGN"/>
    </marker>
    <token><exception postag="FOREIGN"/></token><!-- Found by a rule for 3-word foreign phrases -->
  </pattern>
  <message>The phrase '\1 \2' is a foreign phrase. Use simpler English words.</message>
  <short>Foreign phrases</short>
  <example>Did you see <marker>Ipso</marker> yesterday?</example>
  <example>The <marker>facto</marker> the matter is that...</example>
  <example>... but <marker>ipso the facto</marker>, we must...</example>
  <example>What does '<marker>ibidem</marker>' mean? [This example shows that you might want to ignore quotes foreign phrases.]</example>
  <example type="incorrect">A lawyer who challenges well-established models is not <marker>ipso facto</marker> mistaken.</example>
</rule>

This is not a perfect solution, but it decreases the manual effort, because for each new foreign phrase, you need only to add the phrase in disambiguation.xml.

Other options. For one-word terms, you need only 1 disambiguation rule, Or, you could put the terms into added.txt.