Style Tools Contribution

troy · August 21, 2015, 11:20pm

Hi there,

I would like to contribute all of the stuff I’ve been working on. I like this project and would like to see some more style features added to the program.

I have reverse engineered the algorithm for the Writer’s Diet and the Hemingway App, both great programs by the way. I have created my own versions of these programs. The Writer’s Diet tests for flabby writing and highlights words that could be trimmed down. The Hemingway app underlines sentences that are too flabby. Both of these tools are very useful for writers, whether they be fiction writers, academic writers or even just someone writing an email. I’ve created my own versions of the programs in Java, Python, Android APK and eLisp.

The Writer’s Diet tests for overuse of ‘problematic words’. For example, the word industrialization may be better written as industrialize. One factor in the test looks for words like this – nominalized verbs – by looking for words that end in -ion, using regex.

http://www.writersdiet.com/WT.php

The Hemingway App is another interesting algorithm. It most likely uses the Flesch–Kincaid Readability algorithm, but applies it to sentences. There are many Readability algorithms on the net, but it was quite ingenious to apply the algorithm to sentences instead of documents. I made my own mirror version of this program and tested it with students. I found that with fiction, a readability score should be above 60. However, with academic writers (even PhD level), it would be a good effort to get the readability score above 40. This is because academic writers have to use longer words, at least more than fiction writers do.

http://www.hemingwayapp.com/

Further on the Readability scores, I have a developed a better way to calculate it. The algorithm needs ‘syllable counts’ of words to calculate the score. I noticed early on in my experiments that many readability tests vary in their results. Or in other words, they lack reliability. After a bit of digging, I found out that this occurs because there isn’t an accurate way to calculate syllables. Some algorithms just calculate syllable counts by counting the amount of vowels in a word. This doesn’t work very well because many English words end in a vowel (amongst other problems, like blend vowels). I fixed this problem by finding a pronunciation dictionary and calculating the vowels in the ‘pronunciation phrase’. This worked because pronunciation phrases don’t use vowel blends or end phrases with blends. I then created a dictionary of syllables, which I use to calculate syllables. I also use the ‘count vowels method’ in case a word is not in my syllable dictionary. There will always be errors, but at least I eliminated a fair chunk of them. I can quantify the effectiveness of my new method by benchmarking the different methods of counting syllables.

I program using Object Oriented Principals, so I can create a separate Java class for finding flabby sentences and for finding flabby words. These could be attached to the LanguageTools library with ease (especially since the LanguageTools are so well designed). However, I do think it would be good if the program could use different colours for the underlining, especially for identifying flabby words. The flabby words algorithm identifies 5 different types of flabby words; it helps a writer to be able to see these simultaneously to fix these problems. As for finding flabby sentences, I would prefer to have at least 2 different coloured lines, red for very flabby sentences and maybe orange for ‘just a bit flabby sentences’. We could even use orange underlines for sentences with a readability score under 60, and red for sentences under 40. This way the system is useful for fiction writers and academic writers. (I noticed a lot of students felt downhearted when their entire document was coloured red [when red would be applied for anything under 60]).

As for future projects, I’m currently working on creating a cliche database, so that the grammar checker can identify them. It’s okay to use cliches now and then, but it would be useful to know when you use them too much.

As of yesterday and today, I’ve been working on a paragraph checking program. The program checks whether the writer has created paragraphs in the correct places. I’m using factor analysis to create this script. This program will basically tell you if you should probably split paragraphs or whether you should join two paragraphs together. I started designing this class because a Phd supervisor suggested it would help him mark PhD submissions. He said that he often has to suggest to students to restructure paragraphs. Apart from this, I’ve also been working on some buffer objects to speed up processing times and change my O algorithm. I’ve been introducing threading to my programs to also increase speed. I’ve also been making some structural changes to my programs to make them faster. For example, I reduced my syllable dictionary from 100,000 entries to 50,000 entries because half of it could be calculated with a basic vowel counting method. I will speed this process up further by splitting the dictionary up by letter, that way when I need to retrieve a word, I only need to look in the place of the first letter.

What are your thoughts on these possible additions? Is it possible to add different coloured lines for different checks?

Troy

PS. Here are some screenshots of my work:

dnaber · August 22, 2015, 2:29pm

Troy,

thanks for your message. It would be great to have your software or parts of it as contributions for LanguageTool. In the future, we could have more than the two colors we have now (one for spelling, one for everything else). What LT cannot do yet is complain about complete sentences. Highlighting the sentence for style problems could hide more important grammar errors in the sentence. I’m not sure yet how to solve that.

Regards
Daniel

troy · August 22, 2015, 9:02pm

Hi Daniel,

Cool, I’ll make some Java classes that can easily plug-in to the program. I successfully packaged the GitHub project yesterday, so I should be able to test and tinker.

I do like the idea of having multiple colours for underlining. However, it would be a problem, as you said, if style underlines obfuscated grammar underlines. Maybe we could create a priority system that automatically gives grammar and spelling mistakes precedence. Style checks are a lower priority because it is up to the user as to which words to keep or get rid of. Style checking is different to grammar checking in that it’s about optimizing your writing rather than simply eliminating all mistakes.

I was also thinking that I could merge the style checks in with the grammar checks. This may be ideal in the short term because it won’t require different underlining colours to be created. The sentence readability app would be the easiest to merge. I could make the program only underline a flabby sentence when it is under 30 (readability score) this way it will detect flabby sentences whether the writing is academic or fiction. It will also ensure that the program is not overloaded by having to underline everything in the document (something that I noticed happens often with academic students). This would be useful for a writer because they could quickly identify a sentence that maybe needs to be split in half or changed. I could also make it just underline the first word of a flabby sentence. This way there is less chance of the underline conflicting with a grammar/spelling underline.

The Writer’s Diet like system is a bit more tricky because it underlines words everywhere. It’s more about minimizing problematic words than eliminating all of them. Yet I could also merge this algorithm in with the grammar checker too. For example, I could underline words in blue that have two or more problematic words in the same sentence. Each factor of the Writer’s Diet has different thresholds. For instance, prepositions (green words) have a higher threshold than junk words (purple words). Say, with the junk word ‘that’, linguists often suggest minimal use of this word. Some say you should try not to reuse this word more than twice in a paragraph. So I could detect if words like this are used twice in a sentence and underline them if so. Whereas, with problematic preposition words, they could have a threshold of 4 or 5 in a sentence depending on their ratio.

Thus, both algorithms could be merged with the existing system and still be very useful for writers. These changes may actually be better in some ways because they don’t assume that the user knows what the annotations mean. A user may not know how to decipher the syntax highlighting of the Writer’s Diet like system. So, it may be clearer if the program just underlines a few overused problematic words (e.g. using “that” twice in a sentence) rather than expect the user to decipher the Writer’s Diet like syntax.

I’m happy to experiment with all of these ideas and present my findings. I think I’ll work on this project a lot because I really want these style checks in LibreOffice for my own use.

Troy

dnaber · August 22, 2015, 9:50pm

Hi Troy,

great, I’m looking forward to your findings. I think in a first step the rules can just annotate the way that makes most sense, e.g. sentence-level rules would annotate the whole sentence. It is then the responsibility of the user interface to interpret that and show it to the user in a sensible way. Note that we have several user interfaces, some of which we have almost no control of the way errors are shown: LibreOffice/OpenOffice integration (no control), LT stand-alone (full control, but difficult to make changes to); languagetool.org (like stand-alone).

We’ll need to invest quite some work on the user interface anyway, as LT’s focus currently is findings errors, and not so much improving a text. With your new rules this will change a bit.

Regards
Daniel

Mike_Unwalla · August 26, 2015, 9:15am

Hi Troy,

I like your enhancements to LT.

I was also thinking that I could merge the style checks in with the grammar checks. This may be ideal in the short term because it won’t require different underlining colours to be created.

As an alternative to underlines, possibly you could use highlight colours. Refer to Development Overview - LanguageTool Wiki and ‘errorColors’ in https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-standalone/CHANGES.txt.

Here is an example:

Say, with the junk word ‘that’, linguists often suggest minimal use of this word.

Your junk is my gold. Refer to these documents:
Global English Style Guide (http://support.sas.com/publishing/authors/kohl.html)
Simplified Technical English specification (ASD-STE100 Downloads)
Improving Translatability and Readability with Syntactic Cues (https://www.oasis-open.org/committees/download.php/35862/kohl1999.pdf)

Best regards,

Mike Unwalla

troy · August 26, 2015, 10:12am

That looks great, they would be perfect.

I will take a look at your documents.

Also, here is a playlist of a Standford writing course. The content is mostly about style:
https://www.youtube.com/playlist?list=PLUk4uy2jPpXVGXqVhgs352q6jOdI608Qg

There is nothing wrong with the word ‘that’, but it can be overused. This is where a style checker will differ from a grammar/spelling checker: it’s not about eliminating words but rather about reducing problematic ratios of certain words. The word ‘that’ often creates a sub-layer in a sentence that makes a sentence just a little bit more difficult to read.

Helen Sword, New Zealand linguist and author of the Writer’s Diet, writes a lot about junk words (it, this, that, there). In her book she writes:

“And what’s wrong with that? When used as a determiner (‘that girl’, ‘that hat’), nothing at all. However, in its grammatical function as a relative pronoun, ‘that’ often encourages writers to overload their sentences with subordinate clauses, driving nouns and verbs apart in the process.” (Stylish Academic Writing, pg. 58).

Helen gives an example of a problematic passage:

“In a series of important papers, John Broome has argued that the only sense of ‘should’ at work here is the one that we use in saying what there is most reason, or decisive reason, to do and that the apparent contradiction in the example is removed when we make appropriate distinctions of scope [Philosophy]”

Can you see how the word ‘that’ disrupts the style of this text? This is what Helen tries to quantify and remove with her Writer’s Diet. The word ‘that’, along with the words ‘it’, ‘this’, and ‘there’, should not be removed. Instead, they should be reduced.

Troy

dnaber · September 20, 2015, 9:37am

Hi Troy, are you making progress with this? Can we help somehow?

Regards
Daniel

troy · September 20, 2015, 8:26pm

I will create a basic Java class for both the WritersDiet and Lawrence Apps and post them later today. All I have to do is clean up the methods, add comments and clean up the process of loading the syllables. I will also include testing classes to illustrate how the objects can be used.

I will try to make the code as neat as possible, so that you can work on it as well. The algorithms are really simple, so you should be able to add stuff to them and adapt them for whatever is needed.

(I’ve been a bit distracted creating a writing buffer to speed up syntax highlighting. So far, I’ve created a way that analyses one word at a time based on cursor position. I want to create a writing buffer that is fast with either 10 words or 10,000 words).

Be back soon,

Troy

troy · September 22, 2015, 12:03pm

Okay, I’ve created a basic Lawrence class and some examples of how to use it.

I will put the code here, but it needs the syllable dictionary to work. I will also create a github page for it and deploy a JAR file for easy re-use.

Here's the UML:

Here's the GitHub page containing the code:

https://github.com/troywatson/Lawrence-Style-Checker

Cheers,

Troy

(PS, I’ll work on the WritersDiet class now).

dnaber · October 4, 2015, 9:42am

Are you planning to add a web demo so one can try the algorithm? By the way, in case we want to include this into LanguageTool, the license would need to be changed, as we cannot use GPL code.

troy · October 8, 2015, 9:10am

I can figure out how to make a web demo. It should be easy with Java. I’ve
released JARs on the Github page, but I need to release a new one, because
I’ve just fixed a few bugs that were lowering the accuracy of the syllable
counter.

I’m okay with using a more compatible license. I could use a custom made
creative commons license. What requirements does the language-tools project
need (e.g. rights to edit code, sell products made from code, etc)? I’m
pretty much happy for anyone to use, change and deploy my code for whatever
purpose. But I would like my name on the code so people can track where it
came from. That’s pretty much the only thing I want in my licenses.

btw, I’ve got to clean up my writers diet object and release it on Github
soon. I’ll try to get onto that asap.

On Sun, Oct 4, 2015 at 8:42 PM, dnaber [via LanguageTool User Forum] <
ml-node+s2306527n4643180h38@n4.nabble.com> wrote:

Are you planning to add a web demo so one can try the algorithm? By the
way, in case we want to include this into LanguageTool, the license would
need to be changed, as we cannot use GPL code.

If you reply to this email, your message will be added to the discussion
below:

http://languagetool-user-forum.2306527.n4.nabble.com/Style-Tools-Contribution-tp4643086p4643180.html
To unsubscribe from Style Tools Contribution, click here
http://languagetool-user-forum.2306527.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4643086&code=dHJveXdhdHpvbkBnbWFpbC5jb218NDY0MzA4NnwtMTI2MTc1MjcxNg==
.
NAML
http://languagetool-user-forum.2306527.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

dnaber · October 8, 2015, 9:31am

For its dependencies, LT needs a license that’s compatible with its own license (LGPL), e.g. LGPL, Apache License, or MIT.

troy · October 8, 2015, 10:16am

That should be fine, I’ll switch to one of those licenses.

On Thu, Oct 8, 2015 at 8:31 PM, dnaber [via LanguageTool User Forum] <
ml-node+s2306527n4643212h30@n4.nabble.com> wrote:

For its dependencies, LT needs a license that’s compatible with its own
license (LGPL), e.g. LGPL, Apache License, or MIT.

If you reply to this email, your message will be added to the discussion
below:

http://languagetool-user-forum.2306527.n4.nabble.com/Style-Tools-Contribution-tp4643086p4643212.html
To unsubscribe from Style Tools Contribution, click here
http://languagetool-user-forum.2306527.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4643086&code=dHJveXdhdHpvbkBnbWFpbC5jb218NDY0MzA4NnwtMTI2MTc1MjcxNg==
.
NAML
http://languagetool-user-forum.2306527.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

dnaber · November 27, 2015, 8:03am

Hi Troy, did you make progress with the web demo?

Regards
Daniel

troy · November 29, 2015, 8:30pm

Hey Daniel,

sorry for the delay. I just finished off semester 2 at uni and had a few
last assignments/exams to finish up.

I had a look at this yesterday, and was trying to figure out a way to
upload an applet somewhere. I thought I could upload the JAR to my google
drive and run it from my Blogger site? I might follow the way this person
did it:
http://www.dreamincode.net/forums/topic/209148-embed-java-applet-in-blogger-blog-posts/

This would be great if it works.

Troy

dnaber · November 29, 2015, 10:27pm

Java applets have some issues in general, most people today have disabled Java inside the browser due to security issues. So a real web-based solution without the need for Java would be better.

troy · December 18, 2015, 5:13pm

I uploaded a workable Writer’s Diet library in Java. I’m going to call it the Style App to distinguish it from the Writer’s Diet program. Here’s the link to the GitHub page:

https://github.com/troywatson/StyleApp-Style-Checker-Java-Write-like-a-Pro

I’m now going to work on making a web example of both the LawrenceApp and the StyleApp.

Troy