Thursday, November 21, 2019

Ecs, CircleCI, NLB and Grpc

I finished a migration from Heroku to ECS using CircleCI orbs for AWS.
I detail the aspect of the migration and the peculiarities of the project.



I knew of CircleCI back from 2016. I just knew that it was a more accessible tool to configure than Jenkins, and the project I was working on already had it.
Back then, I did not pay much attention to it since my main task was developing.
Cue to mid-2019, the company I'm working on switches from Jenkins to CircleCI. They did not want to spend resources, time, or anything. They want to "have it working without spending time fixing it.".
They move to CircleCI, and everything is working without the problems we had with Jenkins.
This new project I was working on was using Heroku with a bash script.
We opted to use ECS, and we found that we had support for it.
The documentation for the ECS orb is not clear at all if you don't spend time reading the terms of CircleCI orbs.
There is no clear differentiation between the job and the workflow, and I spent days confused.



I already talked about ECS, three years later and I'm back with it. Things are pretty much the same as before, now we've got fargate, that I have not used, but the documentation on crucial aspects like green/blue deployment is tailored towards that, which sucks, but is what it is.
Things are the same, you create a cluster, you create services, and you place tasks on it.
The new things I'm using this year are;
  1. Green/Blue deployment (soon).
  2. NLB for GRPC (more on it in the GRPC section).
What I do have to recommend is that if you are planning on expensive IO operations and you are planning on using "t2.small" as the best machine, then stay away from ECS. You are going to have downtime because the IO consumes the CPU credits fast.
We've got a situation with a monolithic application that we won't spend time refactoring. Still, the section of the code that works with PDF's has a lot of IO code, some parts are abusing IO due to the library they are using, and some parts abuse IO in the monolithic application itself.



Again with GRPC. In the same situation, learn a bit more about how to debug the server internally, how to force the channel closure.
The "options" parameter for the server and the channel on the client receive a list of tuples. It lacks documentation in Python, and you need to dig deep into the source for figure how things work.
Options that you can use are also missing in Python, so I ended up reading Go examples that gave me an idea.



I'm writing this document on 11/21/19, NLB does indeed work with ECS EC2 deployment type without any problem.

The crux of the problem


I am using ECS (EC2) with a network load balancer to have the GRPC server working.
We are using internal NLB because we are not exposing GRPC to the outside world since this is just for our microservices.
The main problem is that the NLB balancer does not balance the machines behind it. Once the NLB opens a connection, it keeps on reusing it, no matter if you close the channel in the client.
That cascades the problem that If I create an autoscaling group because the EC2 instance serving is degrading, I don't have a way to force that without having downtime, which indeed it sucks.
In theory, GRPC also offers to balance at the client level, but since I'm doing a deployment in ECS, I don't have an excellent way to fix the IP of the server since the IP will change after the first deploy.
I don't know, and perhaps I could opt for a strategy like placing an elastic IP and find a way to articulate this in ECS?.

I will write my findings if I ever find a solution to this problem.

Sources and links used during this research.

Wednesday, November 6, 2019

Reliable queue

During research for my current project, I found an exciting topic to research.
Instead of going with the usual solutions such as celery, RQ, rabbit, zeromq, for example, I discovered a post about reliable queues using just redis.

The solution itself is quite straight forward.
You push incoming lists to redis [l|r]push, [b]rlpop to retrieve, and put in a new list.
N workers consume that list, and you obtain your desired result.
The steps of the worker are simple. It has two steps.
Set a lock (setex) and run the task.
There is a monitor that evaluates the locks, and if a task does not finish, it pushes the task again to be rerun.

This pattern provides me with several good points to describe.
  • I can scale redis.
  • I can put up to n workers to consume.
  • The monitor can take care of tasks that do not complete, expire.
  • Messages can indicate how long they take.
  • If the worker crashes, the monitor can put the task in the queue again.
  • Tasks are in memory
The bad points I foresee
  • If monitor crashes and does not catch up with redis, it is a potential disaster.
  • If redis crashes, everything is lost. Need to work on this point

Besides those points, there is a fundamental point.
I'm running again in ECS, the most notable instance my client is allowing me to use are T2.small instances.
At one point, I was allowed to add A2.xlarge or A2.large instances, but the cost of adapting to arm exceeds the time I can spend on this.

Over the sprint, my concept moved to another sprint, but I'm spending my free time writing an implementation of this excellent solution.

Perhaps I'm talking ahead of time, but I'm excited to learn something new.

I'm writing this proof of concept here. Perhaps I'm overly optimistic, and I do not see the future problems, such as how redis scales works or how to synchronize well the tasks, or how to calculate I/O with time execution, which proves an important part here to keep parts oiled.

During this solution, the other fundamental part of my research involves using GRPC. The workers use a grpc client to place a message, so execution time seems trivial; we receive all the data and call the client.

To create random blocking time, I went with the classic Fibonacci calculation. During this implementation, I found the reduced formula to calculate Fibonacci, and it caught my attention and got a bit deviated in my implementation.
I rarely had the capacity in maths, I recognize that I'm not good, but the closed formula for Fibonacci as an alternative caught my attention, as an alternative solution to recursion is fantastic, but I'm noticing that if we calculate the Fibonacci of 100, results differ with the closed formula.

Either way, this is my weekend project besides others.
I doubt that anyone is reading this, but well, if this works, this is a log of operations.