Hi there,
I would like to contribute all of the stuff I’ve been working on. I like this project and would like to see some more style features added to the program.
I have reverse engineered the algorithm for the Writer’s Diet and the Hemingway App, both great programs by the way. I have created my own versions of these programs. The Writer’s Diet tests for flabby writing and highlights words that could be trimmed down. The Hemingway app underlines sentences that are too flabby. Both of these tools are very useful for writers, whether they be fiction writers, academic writers or even just someone writing an email. I’ve created my own versions of the programs in Java, Python, Android APK and eLisp.
The Writer’s Diet tests for overuse of ‘problematic words’. For example, the word industrialization may be better written as industrialize. One factor in the test looks for words like this – nominalized verbs – by looking for words that end in -ion, using regex.
http://www.writersdiet.com/WT.php
The Hemingway App is another interesting algorithm. It most likely uses the Flesch–Kincaid Readability algorithm, but applies it to sentences. There are many Readability algorithms on the net, but it was quite ingenious to apply the algorithm to sentences instead of documents. I made my own mirror version of this program and tested it with students. I found that with fiction, a readability score should be above 60. However, with academic writers (even PhD level), it would be a good effort to get the readability score above 40. This is because academic writers have to use longer words, at least more than fiction writers do.
Further on the Readability scores, I have a developed a better way to calculate it. The algorithm needs ‘syllable counts’ of words to calculate the score. I noticed early on in my experiments that many readability tests vary in their results. Or in other words, they lack reliability. After a bit of digging, I found out that this occurs because there isn’t an accurate way to calculate syllables. Some algorithms just calculate syllable counts by counting the amount of vowels in a word. This doesn’t work very well because many English words end in a vowel (amongst other problems, like blend vowels). I fixed this problem by finding a pronunciation dictionary and calculating the vowels in the ‘pronunciation phrase’. This worked because pronunciation phrases don’t use vowel blends or end phrases with blends. I then created a dictionary of syllables, which I use to calculate syllables. I also use the ‘count vowels method’ in case a word is not in my syllable dictionary. There will always be errors, but at least I eliminated a fair chunk of them. I can quantify the effectiveness of my new method by benchmarking the different methods of counting syllables.
I program using Object Oriented Principals, so I can create a separate Java class for finding flabby sentences and for finding flabby words. These could be attached to the LanguageTools library with ease (especially since the LanguageTools are so well designed). However, I do think it would be good if the program could use different colours for the underlining, especially for identifying flabby words. The flabby words algorithm identifies 5 different types of flabby words; it helps a writer to be able to see these simultaneously to fix these problems. As for finding flabby sentences, I would prefer to have at least 2 different coloured lines, red for very flabby sentences and maybe orange for ‘just a bit flabby sentences’. We could even use orange underlines for sentences with a readability score under 60, and red for sentences under 40. This way the system is useful for fiction writers and academic writers. (I noticed a lot of students felt downhearted when their entire document was coloured red [when red would be applied for anything under 60]).
As for future projects, I’m currently working on creating a cliche database, so that the grammar checker can identify them. It’s okay to use cliches now and then, but it would be useful to know when you use them too much.
As of yesterday and today, I’ve been working on a paragraph checking program. The program checks whether the writer has created paragraphs in the correct places. I’m using factor analysis to create this script. This program will basically tell you if you should probably split paragraphs or whether you should join two paragraphs together. I started designing this class because a Phd supervisor suggested it would help him mark PhD submissions. He said that he often has to suggest to students to restructure paragraphs. Apart from this, I’ve also been working on some buffer objects to speed up processing times and change my O algorithm. I’ve been introducing threading to my programs to also increase speed. I’ve also been making some structural changes to my programs to make them faster. For example, I reduced my syllable dictionary from 100,000 entries to 50,000 entries because half of it could be calculated with a basic vowel counting method. I will speed this process up further by splitting the dictionary up by letter, that way when I need to retrieve a word, I only need to look in the place of the first letter.
What are your thoughts on these possible additions? Is it possible to add different coloured lines for different checks?
Troy
PS. Here are some screenshots of my work: