English suit/suite

PeterLawrence · April 23, 2015, 11:02am

I’ve been looking to see if I can come up with a rule which can distinguish between the use of suit/suite.
However, it’s a tricky thing since the rules are dependent on the type of object that follows the word.
For example for suit I’ve have…

<rule id="SUIT OF" name="suit of cards" >    
            <pattern>
                <marker>
                    <token>suite</token>
                </marker>
                <token>of</token>
                <token>playing</token>
                <token>cards</token>
            </pattern>
            <message>Did you mean <suggestion>suit</suggestion> of playing cards?</message>
            <url>http://www.oxforddictionaries.com/definition/english/suit</url>
            <!-- suit</bold> means a costume, a set of garments, a claim in court, or a set of playing cards bearing the same mark -->
            <example type="correct"><marker>suit</marker> of playing cards</example>
            <example type="incorrect"><marker>suite</marker> of playing cards</example>
        </rule>

    
<rule id="SUIT OF" name="suit of" >    
            <pattern>
                <marker>
                    <token>suite</token>
                </marker>
                <token>of</token>
                <token regexp="yes">cards|armour|sails|cards</token>
            </pattern>
            <message>Did you mean <suggestion>suit</suggestion> of <match no="3"/>?</message>
            <url>http://www.oxforddictionaries.com/definition/english/suit</url>
            <!-- suit</bold> means a costume, a set of garments, a claim in court, or a set of playing cards bearing the same mark -->
            <example type="correct"><marker>suit</marker> of armour</example>
            <example type="incorrect"><marker>suite</marker> of armour</example>
        </rule>

So as far as I can see this involves identifying all possible objects, which is associated with the word suit.
This seems like an impractical solution.

… and likewise for “suite”

<rule id="SUITE_OF" name="suite of" >    
            <pattern>
                <marker>
                    <token>suit</token>
                </marker>
                <token>of</token>
                <token regexp="yes">software|rooms|protocols</token>
            </pattern>
            <message>Did you mean <suggestion>suite</suggestion> of <match no="3"/>?</message>
            <url>http://www.oxforddictionaries.com/definition/english/suite</url>
            <!-- suite means a musical composition, a staff of attendants, or a set of things that form a unit -->
            <example type="correct"><marker>suite</marker> of software</example>
            <example type="incorrect"><marker>suit</marker> of software</example>
        </rule>

The only easy one was lawsuit…

<rule id="LAWSUIT" name="lawsuit" >    
            <pattern>
                <marker>
                    <token>law</token>
                    <token inflected='yes' regexp="yes">suit|suite</token>
                </marker>
            </pattern>
            <message>Did you mean <suggestion>law<match no="2"/></suggestion>?</message>
            <example type="correct">The plaintiff brought a <marker>lawsuit</marker> against the defendant</example>
            <example type="incorrect">The plaintiff brought a <marker>law suit</marker> against the defendant</example>
        </rule>

Any suggestions of an alternative technique which could be used?

I expect there are a number of other words which have a similar issue, i.e. effect/affect/resent/recent so has anyone else encountered this problem before?

The only solution I can think of is an internet phrase search tool which can identify the frequency of different chucks of text on the web or in a database. I believe other grammar tools may use this technique.
I’ve looked at using the Google Web Search API, but this seems to be deprecated now.

Thanks

dnaber · April 23, 2015, 6:54pm

For English, an easy way to get frequency data is the Google Web 1T 5-Gram Database - just search for “xyz " if you’re interested in which words follow xyz, or " xyz” if you’re interested in which words precede it: http://corpora.linguistik.uni-erlangen.de/demos/cgi-bin/Web1T5/Web1T5_colloc.perl (taken from Tips and Tricks - LanguageTool Wiki)

One can also try using the data directly, without writing rules, as documented at Finding errors using Big Data - LanguageTool Wiki. However, the data is large and it still requires adjustment to get good quality results.

Please let me know when you think your rules are ready for inclusion.

PeterLawrence · April 24, 2015, 4:42pm

Thanks for the info, looks interesting. Will investigate further at some point.
3GB is a lot to download, is there an API for accessing this data remotely?

One solutions I tried was to write a prototype routine for the language tool standalone utility which uses the google book’s API, this method does not require an API key, but only generates limited results.

public void PhraseSearchBooks() {
// see Using the API | Google Books APIs | Google Developers
String google = “https://www.googleapis.com/books/v1/volumes?maxResults=1&q=”;
String charset = “UTF-8”;
String TheSelectedText=ltSupport.getTextComponent().getSelectedText();
if (TheSelectedText.length()>0)
{
String search = “"” + TheSelectedText + “"”;
URL url;
try
{
url = new URL(google + URLEncoder.encode(search, charset));
Reader reader;
reader = new InputStreamReader(url.openStream(), charset);

		  JsonParser parser = new JsonParser();
		  JsonObject obj = parser.parse(reader).getAsJsonObject();
		  	  
		  JsonElement resultCount = obj.get("totalItems");
		  if (resultCount != null) 
		  {
			  String countMessage = resultCount.getAsString();
			  System.out.println(countMessage);
			  JOptionPane.showMessageDialog(null, "Found " + countMessage + " books matches for " + search);
			  return;
		  }
	  }
	  catch (MalformedURLException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
	  }
      catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
	  }
	  catch (IOException e) {
		// TODO Auto-generated catch block
		Tools.showError(e);
	  }
	  JOptionPane.showMessageDialog(null, "no results found");
  }

Another method I found was to run a SQL query on the Google N-Gram Web API, which allows the downloading of csv results. i.e.
SELECT ngram, FIRST(first), FIRST(second), FIRST(third), SUM(cell.match_count)
FROM [publicdata:samples.trigrams]
WHERE first = “suit” AND second = “of”
GROUP EACH BY ngram ORDER by f3_ DESC

As for the rules; The “Lawsuit” rule is complete, but for suit/suite I’m not sure the XML approach will work. So I might investigate the LanguageTool ngram Error Detection when I find enough disk space.

dnaber · April 24, 2015, 6:29pm

I don’t know of any other APIs. But this SQL query looks promising - with this, it should be possible to write a tool that takes a word pair and automatically produces two rules by looking which words in the context can only appear with one of the words. For example, “internet add” might have almost zero occurrences, while “internet ad” has a lot, so the pattern “internet add” might be a useful rule.

I’ve added the lawsuite rule, thanks. I modified the message like this:

<message>Did you mean <suggestion>law<match no="2" regexp_match="suite" regexp_replace="suit" /></suggestion>?</message>

This way, for “law suite”, “lawsuit” is suggested and I think that’s correct, isn’t it (unlike “lawsuite”)?

Regards
Daniel

dnaber · April 25, 2015, 9:39am

Hi Peter,

I guess this refers to Google BigQuery? Did you try these queries? It would be interesting to see how fast they are.

Here are some links, mostly for myself (if I find time to take a look at this):

https://groups.google.com/forum/#!topic/bigquery-discuss/OT_W0ayVSvg

PeterLawrence · April 27, 2015, 3:39pm

Yes Google BigQuery
I’ve found a little python project on Github, which could be used to get some timing information.

PeterLawrence · April 27, 2015, 6:03pm

Also another interest site is Microsoft Web N-gram Services
http://weblm.research.microsoft.com/

PeterLawrence · April 30, 2015, 5:41pm

Hi, I have now downloaded the NGram database and enabled the languagemode option.
With a small edit to homophone.txt file. I changed line 782 from “suite, sweet” to “suite, sweet, suit” LanguageTool seems to identify the following incorrect use of suite/suit in the these sentences.

They have a suite of armour.
They have a suit of software.
I was in a hotel suit.

It also statistical identifies errors with “whether” and “affect” in this sentence “The whether had an affect on me”.
Note the issue with “affect” was also identified by the “affect vs effect” rule.

For some reason when I included “affected, effected” to the homophones.txt file this didn’t seem to get process by “RuleMatch[] match” method of ConfusionProbabilityRule.
Have you any ideas as to why certain words are not passed to the method?

I noted at the top of homophones.txt it states

Note that entries from this file might be ignored by the ConfusionSetLoader, according to data in homophonedb-info.txt.

I can’t find the file homophonedb-info.txt so I assumed it this file homophone-info.txt.
The homophone-info.txt file looks a bit cryptic, so does this file need editing as well?

Looking at the source code the method “score” of ConfusionProbabilityRule, might be a place where I could graft some test code to see if accessing a remote N-Gram Web API is a practical alternative to having a local large database.
Thanks
Peter

dnaber · May 1, 2015, 9:30am

Hi Peter,

I’ve improved the comments in homophones-info.txt. The last column is the percentage of errors and if it’s > 10.0 the word is ignored by default, as we assume it would create too many false alarms.

It’s possible to make the index of ngrams smaller, as it only needs to know about the homophones (and not about all words), but it would still be a few hundred megabytes.

Regards
Daniel

PeterLawrence · May 1, 2015, 11:00am

Thanks, if I add “affected” and “effected” to homophones-info.txt is does detect the error with this sentence “I was effected by what he said to me” as well.

.

PeterLawrence · May 2, 2015, 1:03pm

In interesting since I think Suite and suit are not Homophones, but rather a set of commonly confused words.
This website, alphaDictionary * Often Confused False Cognates (Words) in English * All identifies 250 common confused words which it classifies as false cognates. Is false cognates the correct term?
Thanks

dnaber · May 2, 2015, 7:59pm

I just named the file “homophones” because most of the pairs will be homophones. I haven’t often hear “false cognates” before.