Hi, LanguageTool might have the chance to work with computer science students for a 3 month (not quite full-time) project. The students would work in small groups, implementing some LT-related task we provide. I'm looking for ideas what this task could look like. The students are in their 3rd year and have done projects like these before. As these are CS students, the task will need to be more advanced than "implement some XML rules". Please post here (or mail me privately) if you have ideas.
First thing that comes to mind is rule efficiency. Have them analyze existing rule( interpretation function)s and refactor them to use less processing time.
- Create a Java interface for FreeLing chunkers to be used in LanguageTool. Easy to read and create chunkers. Obvious reasons for any language covered by FreeLing.
It would allow writing chunkers easily, just like with the XML rules.
It would be usable immediately with Iberian languages (5 languages) and other languages could create, in weeks, a decent chunker.
- Improve the Option GUI in order to allow togglable rule sets and underlying interfaces.
This is useful for agreement rules, formality levels, style guide in use, etc.
The issue with FreeLing is that it's published under AGPL, so LT (under LGPL) cannot use their code.
My proposal: Create a more powerful version of the rule "Wrong word in context".
- Automate the extraction of context words from a corpus (excluding stopwords).
- Automate the testing. (This could be used to apply some basic machine learning concepts, like train/validation/test sets...)
- Test and weigh a few parameters: number of context words, distance between context words and the dubious word, lemma vs. word form...
For some kind of confusions I am pretty sure this is the right approach. But I don't know how many cases it will solve in different languages.
Also, could they make a tool that analyzes a rule and suggests how to refactor the code? (Then, we would have a tool to help us to do that job.)
I meant dynamic linking, similar to the POS dictionaries and other external components.
If even that is impossible due to licensing comflicts, creating a metalanguage for chunkers that could be written in plain text (or very simplified code) would be equally great.
Maybe improving the
<phrase/> tag system, by moving it outside grammar.xml/disambiguation.xml, giving it an independent process with higher priority, and working out any kinks that might appear in the process.
I also find @jaumeortola 's idea very useful, if well implemented.
Perhaps they could perform black-box reverse-engineering (theoretically doesn't have any license-conflicts) on FreeLing.
A way to keep metadata about each rule and get information about a rule. Its history, who changed what, when the change was made. Why the change was made. Refer to http://forum.languagetool.org/t/history-of-a-rule-or-a-rulegroup/1252.
Correct some of the bugs that are listed on https://github.com/languagetool-org/languagetool/issues.
GUI: Add a flag so the user can choose to show non-printing characters (examples: space, CR, LF, non-breaking space]
Although it has been out of LT's scope by now:
How about providing support for some markup languages (HTML, LaTex)?
Related to this might be the integration in some popular WebEditors such as CKEditor or TinyMCE.
A way to find errors in 'correct' text.
When we write a rule, we can make sure that the correct examples do not contain text that cause other rules to give warnings. But, we continue to develop new rules. Some of the correct examples can then contain text that would give an error message. (I think that someone made a similar comment on the forum some months ago, but I cannot find the comment.)
on a related note: maybe they could help with hunting down 'incorrect' examples in 'style'-rules that violate the keywords at the heart of the issue.
For example, the Dutch 'toentertijd'-rule (which seems to be based on a single non-credible source) gives as an 'incorrect' example "A toentertijd B", which is a full violation of the word's meaning as it's only meant to apply a time-frame to adjectives/adverbs/facts.
Better handling of multi-part words.
i.e. El Salvador was not recognised, but I could not recommend it to Add, because it's a two part word.
I think we have some issues with markup in TinyMCE component.
They could also work on upgrading our TinyMCE to version 4.
Thanks for all your ideas. The ideas were good, but I still couldn't find one that would be a good fit from both our and the students' point of view for a 3-month project. Anyway, this was not their only project and I'll stay in touch with the university. If you have more ideas, please post them.
I think it is possible to add it by manually modifying the suggestion URL:
One big topic could be the handling of multi-language texts or inserts from other languages.
Happy Birthday wünsche ich Dir!
Gestern habe ich mir Dirty Dancing angesehen.
Wollen wir und den neuen Tom Cruise Film im Kino anshen - Mission impossible?
And simply ignore text that has been set to have 'No Language'. (at the least, it would allow you to write wonky text of a justified nature, without being continuously distracted by a myriad of error reports.)
EG1: The game-host asked Joseph "Joseph! Wat klopt niet aan de volgende zin? Vanwege de continu veranderden wegwerkzaamheden is de heer meervoudich vertraagd."
This Dutch sentence has both a full-typo and a wrong-context word, but it is technically NOT incorrect, as it's a grammar/spelling puzzle
EG2: "What are they're here for?" Was his concussion-confirming reply