LT and GSoC 2018 - looking for students

dnaber · February 27, 2018, 3:42pm

Welcome to LT! Both Chinese and Japanese are not maintained - for Chinese, we got some contributions lately (will be part of LT 4.1), so this would be a good time to revive support for Chinese. I think the best approach is to start using LT (on the homepage, via its add-ons etc.) and see if you can come up with ideas on how to improve its support for Chinese. Our wiki also has some ideas that require machine learning. When using LT, just consider that we try to find errors on all levels: spelling, style, grammar, semantics. Even though I don’t speak Chinese or Japanese, I guess there are a lot of error detection rules that can be implemented on all levels. Just keep in mind that for GSoC, some “real” coding is required, not just writing XML rules.

Nico · March 18, 2018, 10:03pm

Hi,

I was studying Grammatical Error Correction systems from the ACL archive in preparation for a phD project when I noticed that they don’t make use of any knowledge about discourse structure, the level of linguistic description that goes beyond the single sentence.
This means that neither rule-based systems nor the statistic ones have incorporated a codification of long-distance relationships, discourse relations and connectives.
As a consequence, many mistakes are not identified by the program, and notably mistakes in the use of verb tense and mode.
To give an example, as far as I know, even the best state-of-the-art systems don’t recognize that a sentence like:
“He came back from London when he will receive a call”
contains a mistake.
Similarly, a sentence like
“He thought that he never does that”
Is marked as correct, even though it would trigger a correction from a native speaker of English.
I found similar issues with Italian. (e.g. sentences with unidentified mistakes: “Quando ritornò da Londra, Luca mangia una mela.” Correct: mangerà; e.g. “Mi chiedo perché tu non sapresti mai quello che serva.” Correct: sappia.)
The reason seems to be that correctors are not tuned to recognize elements of discourse structure such as the so called “consecutio temporum” – the ordered sequence of verb tense and mode in sentences – or the role of connectives, which are incorporated at a very general level in many grammar textbooks.
I successfully tested a basic example of an xml rule correcting such pattern:

    <rule id="WHEN_generalized" name="Past+When+Fut">
        <pattern>
        <or>
        <token postag="VBD"></token>	<!—verb at the past tense or past participle -->
        <token postag="VBN"></token>
        </or>
        <token>when</token>		<!—when -->
        <token postag="PRP"></token>	<!—personal pronoun -->
        <marker>
        <token>will</token>
        <token postag="VB"></token>    	<!-- future -->
        </marker>
        </pattern>
        <message>change verb to the past tense</message>
        <example correction='arrived'>I had already gone when he<marker>will arrive</marker>.</example>         
    </rule>

I wonder if this kind of work could be done within the GSOC project. It would require a systematic survey of the linguistic theory on discourse and a codification of the rules in a formal framework, such as the XML above from the development tools. I’m also quite familiar with Java.

dnaber · March 19, 2018, 8:32am

Yes, I think this could be a GSoC project. It’s hard to tell for me whether this is enough for 3 months. Like any proposal, yours would have to include a rather detailed road map.