Back to LanguageTool Homepage - Privacy - Imprint

Use a modern framework for embedded HTTP server discussion


(Oleg) #1

Continuing the discussion from Spellchecker improvement discussion.

Are the requests and their execution time collected somewhere (in server logs maybe)?
That could help to reproduce the peak load moments to explore the behavior in detail.
Also there’s an idea to learn to predict the request execution time (hello, ML) and then to optimize the requests execution order.


(Daniel Naber) #2

Yes, input length, language, and execution time are logged.

Maybe. I’m not sure if it helps, we don’t have hundreds of request per second per machine.


(Oleg) #3

So there are several machines and each one receives about less than hundred requests per second and the queries still being rejected sometimes?


(Oleg) #4

I think that dealing with the overload is the case appearing only on languagetool.org’s server-side and is not needed for the consumers downloading the LT-server for their local use. I suggest to remove the requests queuing from the languagetool-server code and to implrment it separately only on the LT’s deployment via proxy-tool (nginx or haproxy or smart handcoded thing). The proxy also could deal with http/https resolution, ddos, balancing between nodes etc.

That will remove most low-level code from the languagetool-server and will keep only the api part if we use springrest or springboot.

What do you think about that approach?


(Daniel Naber) #5

No, they are slow sometimes, e.g. 1-2 seconds when <0.1 seconds would be okay. There are several machines, but this happens also on a single machine.


(Daniel Naber) #6

We don’t have access to the proxy / load balancer, it’s a standard load balancer running in the cloud.


(Oleg) #7

And the instances running the LT-server, are they accessible? We could do messages queuing via separate tool on the instance level.


(Daniel Naber) #8

I’m not sure what you mean by “accessible” - the servers are accessible via ssh.


(Oleg) #9

That’s exactly what I meant. I’m going to load-test the original LT-server implementation along with the sparkjava and springboot proxy-driven implementations.
Now I’m not sure if I need the errors-containing data. Maybe I’ll artificially generate some errors, but anyway I’m interested in the ability to look at the pairs <original sentence, corrected sentence> collected by LT. Is it possible for LT to share that data?


(Daniel Naber) #10

Even though the users allow us to store their data, it might still contain personal information, so I cannot share it (or only examples which I have checked).


(Oleg) #11

So I’ll proceed with the artifical generation of the errors.


(Andriy) #12

We could also look into reactive approach/frameworks, it’s getting very popular lately with serverless direction. I believe springframework has some support for this too.


(Oleg) #13

Sounds reasonable to me, I’ll give it a try. Thanks for the suggestion :slight_smile:


(Daniel Naber) #14

It turns out that the “peak” situations are less bad, it was at least partially a problem with the measurement. Use I used curl with the https URL, but the time this command takes includes the HTTPS setup time. A simple workaround is to use curl url1 url2, with url1 and url2 being on the same server. This way, the SSL overhead will only occur once (i.e., one needs to ignore the time for the first URL).

Anyway, I still consider switching to a lightweight framework a goal.


(Oleg) #15

I’m planning to compare the framework implementations using the http calls cause I don’t think that https support can be the bottleneck.
Now I’m finishing my JMeter test plan and go then to compare SpringBoot and SparkJava (reactive approach mentioned by @arysin is in development) performance on the http load, but the test plan could be easily enhanced to support both http and https.
To simplify and clarify the tests I use default server configs and the LT calls are unauthorized. Is it Ok to load-test with these simplifications or you’d suggest some specific config settings?


(Daniel Naber) #16

I think it’s okay for now. How large is the input? This has a large influence on response time. I think the input length should be random, with a distribution we can find out from our log files.


(Oleg) #17

I think that it’s convenient to take the input from the input file or set of files – that’s the easiest way to control the input size and other params.


(Oleg) #18

BTW, didn’t you plan to replace maven by gradle? The latter is more human-readable etc…


(Daniel Naber) #19

Yes, but I got stuck because of the complexity (LibreOffice add-on, stand-alone, command-line, …). If you want to work on that, might be a nice part of a GSoC project.


(Oleg) #20

I see the maven-site-plugin in plugins list, is it used somewhere in the project? Can’t find any reference neither in utility .sh scripts nor in the documentation.