Back to LanguageTool Homepage - Privacy - Imprint

Excluding URLs (with/out http)


(JensP) #1

Hi,

I am currently developing my first rules, one of them should ignore spell check on URLs without http at the beginning. So I tried to create a pattern of my working regex:

<rule name="Ignorieren von Internetadressen" id="IGNORE_URLS">

  <pattern case_sensitive="yes">
    <!-- @see http://regexr.com/39a8d -->
    <token regexp="yes">https?\:\/\/|\w*</token>
    <token>.</token>
    <token regexp="yes">[^\/\s]+</token>
    <token regexp="yes">\/.*?</token>
  </pattern>
  <disambig action="ignore_spelling"/>
</rule>

But it does not match? What I am doing wrong? Is the "." token the problem?

Is there an existing rule for URLs with HTTP prefix? Because they seams not to be checked, I did not find one...

kind regards


(Daniel Naber) #2

URLs are ignored in SpellingCheckRule.isUrl(), so your disambiguator rule should not be needed. This is only true for "proper" URLs that start with http, https, or ftp, though.


(JensP) #3

Ah ok, but ww need a solution to exclude URLs without "http://" nothing special, just a way to simply ignore notations like subdomain.domain.tld. How can we archieve that?


(Daniel Naber) #4

Maybe something like this (not tested):




<rule name="Ignorieren von Internetadressen" id="IGNORE_URLS">           
      <pattern>
        <token />
        <token spacebefore="no">.</token>
        <token />
        <token spacebefore="no">.</token>
        <token regexp="yes" spacebefore="no">(org|com|net|de)</token>
      </pattern>
      <disambig action="ignore_spelling"/>
    </rule>

(JensP) #5

Tank you works a expected, even if I have modified the last token:

<rule name="Ignorieren von Internetadressen" id="IGNORE_URLS">

        <!-- @see http://regexr.com/39a8d -->
        <pattern>
            <token />
            <token spacebefore="no">.</token>
            <token />
            <token spacebefore="no">.</token>
            <token regexp="yes" spacebefore="no">[a-zA-Z]{2,}</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

Now it should be match the most obvious cases (we do not need a complete 100% solution here.