Back to LanguageTool Homepage - Privacy - Imprint

XML Regular expressions doubt


#1

Hi

Can anyone point me to where I can find more about XML regular expressions? I’ve read the http://wiki.languagetool.org/development-overview#toc15 page, but I have some lingering doubts regarding the use of \ * [ {. I don’t understand their meaning


(Mike Unwalla) #2

I recommend https://www.regular-expressions.info.


(Robin van der Vliet) #3

I myself also really like regex101.com, it’s an useful tool to construct regular expressions.


#4

Thanks!


(Benedict Holland) #5

There isn’t such a thing as “XML regex”. Regular expressions are a normalized set of rules that you can use to parse text.

I really love the python explanation found at

https://docs.python.org/3.4/library/re.html

I will offer a small bit of advice. In general, they are quite difficult to use and way overkill in most situations. Be prepared for many unexpected errors where the regex dutifully matches a line that you never thought about matching.

I also have to ask, XML is a markup language. You shouldn’t parse XML with a regex. Are you parsing expressions within the XML like finding names within a title (is “Bill” in a title tag) for example?


(Tiago F. Santos) #6

@bholland

While it should have been said “the regular expressions used in the XML files that LanguageTool uses”, everyone able to provide a useful answer was able to understand what was meant. And they did provide an answer.

If you want specifics, the type of regexps used should be the one that Java uses, given that is the main coding language. For specifics, vide:

Probably getting into this specifications won’t be needed, anyway, since the bread and butter of this tool only requires basic regexp knowledge that is common to all engines.

@Ferch42 Regexps are not that hard and they are the way to go, if you intend to do actual rules and not just rigid string identifiers. Complex regexps will probably match more things than what you want to in the beginning, but you can always correct them later or add antipatterns (either in the regexp itself or using LT metalanguage).
Most of all, create things and worry about the problems and errors later.


(Benedict Holland) #7

Right. At the cost of being obtuse, I think I understood what the OP meant but wanted to clarify as I know of no XML implementation of regex and actually I can’t think of a situation where I would want to mix regex and XML. I am simply asking because perhaps there is an alternative that the OP might consider rather than going down this path that might be more appropriate. I don’t know, hence the question.

Python regex is particularly excellent, even if it isn’t in Java. It will at least give a solid foundation to someone who is approaching a regular expression for the first time. It is also really nice since you can see what the results are at every single step very quickly using the interpreter and you don’t need to worry so much about types and all of the overhead that a Java example will bring. Not to mention, regex is just about any language is going to be the same, or should be. That would be the whole point of regex isn’t it? The RE module provides a brief and excellent synopsis of the features and functions that I would expect any language to implement. I am also biased since this is how I learned how to use regex and simply offer it as an excellent and clear resource. I am sure there are others.

Unfortunately, I have to absolutely disagree with your last point. If you worry about problems early on in the process, they are much easier to fix than not. If all the OP wanted to do was find specific tags, I wouldn’t use regex. Throughout my career, I have probably used regex hundreds of time and in every single case, there were always situations where something went wrong. Regular expressions are very rigid, unreadable, and can be extremely complex. If the user needs langauge support other than English, things that sound simple turn out to be more difficult. None of this is impossible or can’t be overcome but to state offhandedly about how easy it is to debug a matching error is simply wrong. Some (in my opinion most) matching errors are difficult to track down and even harder to fix. This is particularly true with freeform text and if you are new to regular expressions and don’t know what to expect.

I didn’t mean to come off badly so I am sorry that I did.