Saturday, August 13, 2016

Micro services

Intro

And we are back.

After 10 years, I'm still working in software development.


What I'm working on.

Well, after what I believe to be a failed contract, my company decided that I should be researching microservices, AWS (ECS, EC2, SES, SQS, SNS), Elastic (soon to be replaced by mongo).

What is each acronym that I'm using there?

AWS: Amazon Web Services. Yeah, I'm still on AWS. Been using it on my previous employer (a company in Argentina, that had a contract with someone from the States), where I had some sort of devops / developer position.

ECS: This is an Amazon service that allows you to use Docker inside amazon EC2 containers.

Before giving you my explanation, without reading this article by Yevgeniy Brikman, I could have never ever set up my docker tasks in ECS.

Give it a try to that link first, is the best explanation I found, way better than the amazon documentation.

The main problem that I see with the amazon guide is that you don't have an ordered tutorial.
The article by Yevgeniy Brikman is excellent.

You have the documentation, but it doesn't follow a proper order.

The first thing you will need to have if you are going to use this is an EC2 instance of the type that is optimized for ECS, for example (amzn-ami-2016.03.d-amazon-ecs-optimized)
Amazon covers the containers on the documentation, but is not the first thing.
They start the documentation with creating private docker registries that you will need way later on.

SES: Is used to send emails using the SES amazon service. SES stands for Simple Email Service.
SQS: Amazon Simple Queue Service. A simple queue, self-explanatory
SNS: Amazon Simple Notification Service. This is used to deliver notifications.

Together, SQS + SNS = Like rabbitmq. Amazon does all the heavy lifting for you. Though it has some peculiar things. Due to high replication, there is a chance that depending on how you configure your SQS, one or more workers may receive the same message published by the SNS. They state that in the documentation, your workers must be able to handle properly if a message was already processed. You can handle the visibility of a message once is consumed with a timeframe that you can configure.

The other part of my highlight is going to be the SES part.
SES is like mandril (or MailChimp now). The main thing is when you start the service, you are in the sandbox mode. You will need to verify an email, and you will be only able to send emails to that verified domain.
You need to fulfill a request, so they allow you to deliver emails.
The most important part here is that they ask you how you handle bounces and complaints.
You can create an SNS topic that will receive the bounces/complaints/ deliveries, and you will hook it up with an SQS consumer.
All that, by clicking with the mouse.
Seems pretty straight forward, but they didn't put that on the page.
The main idea is that if you receive a complaint or a bounce, you just take it out of circulation (i.e., don't try to deliver emails to that recipient).

With which language?

Python 3.5.1 (yeah, I decided to use something newer)
I'm using uwsgi / flask, with the following extensions (used across some of the microservices)

  • flask-jwt
  • alembic
  • flask-sqlalchemy
  • greenlet
  • boto3
  • flasgger
  • swagger-ui
Flask-JWT: Is an extension to create JWT tokens, that will be consumed by a JS client in my case.
The extension has some gotchas, you need to read the code if you need to change things.
With some monkey-patching and inheritance, you can alter the functionality of JWT completely, but you need to spend some time reading.

flasgger: I love flasgger. I was using flask-swagger, but being honest. The heredocs under the function names make the whole thing illegible. Flasgger lets you point to a file. If you don't know what is this, this is used to write documentation for swagger-UI.

Swagger-ui: The only complex thing that I needed to do here, was to add support for the JWT token, since some endpoints are protected by JWT tokens, I didn't had a way to push the token there.
I found the way.

Authentication.

I was able to handle this by using JWT. When you are developing microservices, and you have to manage authentication, there are multiple things to take into account.
Disclaimer: this is not a definitive guide. At the moment I'm experimenting. So far is my best solution, but I believe that there are other solutions out there.
If you split a user service and a permission service, moving that token will become tricky, and it's difficult to think sometimes.
The idea is that you pass around that token until it expires.
No token, no service. Simple.
Obtaining the permissions of the entity after logging is a must, and I think that a caching strategy will be needed there.
A simple example, I log in, I obtain my permissions and put some key, like user_id: permissions in a Redis storage.
On the next request, I will ask the Redis for that user permissions.
When for some reason I decide to revoke a permission, I will delete the Redis record, and well, they will need to fetch the permissions again.

This is a work in progress, which I may extend much further later on.

Thursday, May 26, 2016

Gae myth ?

Disclaimer

I wrote this article back in 2013.
I deleted my old blog and I found a copy on the blog of the cooperative I used to work when I wrote this.
Bear in mind that after 3 years, google app engine is not what I describe here.

I'm still working with App Engine using Python, doing scraping of the same sites, but in this post, I'll write about what happened since my last post.

One of the things I managed to do, was to download PDF's using Python MechanizeUrllib and Urllib2.
You may be pondering, well that is not something too complex.... as a matter of fact, it was, and it was terribly annoying and time consuming.
The site I was scraping serves pdf's files, but the files aren't anchors that you can simply click and download.
The site uses session, yes, like in PHP or any other language. So, I actually do not have a "link" per se, I have an input type image that with javascript hits an endpoint. That endpoint refreshes the server session and when we have the response back, we will be redirected (with javascript too) to another page, which is the landing page of the site I'm scraping. In the landing page, it will use my session to detect if I requested a pdf file and there the server will magically give me the file.
Written like that doesn't sounds complex, but you have to take into account that
  • I can't use Javascript in Mechanize.
  • The only Javascript libraries for python in gae(such as Python Spidermonkey) doesn't seems to help too much.
  • I can't use Selenium, because that won't run in GAE, and the client that hired me specifically wants to run this in GAE.
So, after a couple of days (I think that it took me 2 days to discover how the site worked using firebug and analyzing the requests) I come up with this.
browser = self.mechanize_browser
browser._factory.encoding = 'utf-8'
browser.select_form(nr=0)
browser[api_input_name] = api
response = browser.submit(name=search_input_name)
filename = parser.determine_position(response.read(), job_date)
if len(filename) > 0:
   browser.select_form(nr=0)
   # Create a custom request
   data = self.create_custom_api_download_request(api, browser, filename)
   browser.select_form(nr=0)
   # Prepare their ASPSESSION and simulate a submit,
   # that will guarantee
   # a fresh session for the next GET request
   browser.open(main_site, data)
   time.sleep(time_before_opening_page)
   # Now, we indicate their server that we will do a GET
   # this allows us to get the stream
   stream = browser.open(download_url)
   pdf_stream = stream.read()
def create_custom_api_download_request(self, api, browser, event_argument):
        """
        Create a custom request using urllib and return
        the encoded request parameter.
        The keys __EVENTKEY and __EVENTVALIDATION
        are tracking values that the server sends back
        on every page. They change per request
        @var api: String The api of the well
        @var browser: Mechanize.browser
        @var event_argument: The filename
        @return: urllib.urlencode dictionary
        """
        if browser.form is None:
            raise Exception('Form is NONE')
        api_input_name = self.config[self.config_key]['api_input']
        custom_post_params = \
        self.config[self.config_key]['download_post_params']
        payload = {}
        for key in custom_post_params:
            payload[key] = custom_post_params[key]
        payload['__EVENTVALIDATION'] = browser.form['__EVENTVALIDATION']
        payload['__VIEWSTATE'] = browser.form['__VIEWSTATE']
        payload['__EVENTARGUMENT'] = event_argument
        payload[api_input_name] = api
        return urllib.urlencode(payload)

A couple of notes

Using a custom factory for mechanize was required, since we were reading a raw PDF string, the default factory (ie, the parser that mechanize uses to read the data, such as BeautifulSoup) was having a problem with the raw pdf stream. 
So, using browser._factory.encoding = 'utf-8' solved that problem.
Regarding the method determine_position, well don't pay attention to that, because that is just part of the business logic that the site has, and it has to be solved using that method, let's just say that the method locates the pdf "link" in a table, since I can have multiple results.
Then, we create a custom request using urllib, that is the method create_custom_api_download_request .
With that custom request, we will feed our mechanize browser instance and again, more complexities of the site.
If I didn't put that sleep, I was going to be hitting the site really fast and I was getting bad responses, so I used a sleep to win some time. After that, we just use the method open, with our custom request but pointing to the landing page, and voila, I will get the pdf.

Downsides of doing this

Well, without taking into account that the whole flow is terribly complex, and I'm just writing about one specific thing I do, using GAE for this kind of tasks doesn't seems a pretty good idea.

GAE MYTH?

Well, now really the main thing.
My client is really focused and interested on using only GAE for this complex scraping app. He pointed me out about using "tasks", or the push tasks, because you could configure the rate of execution, blah blah blah. Our most important task is PDF Scraping, that I do with PdfMiner
The thing is, this is an automated application, even creating a custom task won't help it, it is too "heavy" to use in GAE, it depletes the resources really fast. By that, I mean, if you have a $2 budget you will have to come up with a very good rate configuration.
Pdfminer is the only good library that can actually give me results in XML that I can parse using lxml.
The pdf files that I read, are complex tables converted to pdf from Microsoft Excel. It was a really complex task to figure out how it worked, but my client provided me with a sample for the first section of the pdf, and I worked out with the second part of the pdf.
I can process 10 pdf's per minute, any value higher than that (ie, 20 tasks per minute, or 20 tasks per second) will end up with the queue dropping tasks because it can't process, and my budget will be depleted faster.
See, I believe that if you are going to be using something experimental as GAE, you should first spend a lot of time researching, not just throw your cash there and expect immediate results.
So, even though i got a budget increase of 5 bucks, I still can't have 24 real hours of uptime. The instance now is heavily focused on running pdf's but if I enable all the things that the instance should be doing, $5 isn't enough !. I managed to run with $2 for around 10 real hours, but again, the only thing that the application could do, was to scrape 10 pdf's per minute and every 15, it was sending HRD results to Fusion tables. (That  is complex too ).
When I say "real hours", I mean real hours, in app engine, it will show up something as 68 hours of uptime, but that are like 10 real hours.

CONCLUSION

Before jumping into something experimental, research , research and research even more.
Don't demand your developer/researcher (that happens to be me with two roles) a 100% accuracy results in less than one day.
Second, and this doesn't applies to GAE, if you are going to build an application, please for the love of god, *HAVE AN IDEA FIRST*, coding without any kind of plan,  a super complex thing and then changing it really fast won't work.
For example, let's say that I worked from March till July, then in less than 4 hours you decide to alter your flow and have migrated 21k records , including something experimental as PDF scraping and have all working perfectly. Bitching at me, and telling me that "you are frustrated because I had a bug such as iterating a list and popping elements and I didn't notice that" when you throw 10000 really "important" tasks all at the same time and expect to have the whole thing complete , free of bugs, with *NO TESTING* is something that is not likely to happen, but well.
Sorry, needed to vent ....
The conclusion is, before jumping into the GAE wagon, research a lot, and I can't stress out the "a lot" part. 
I don't blame GAE for this, I think that is a great thing from Google, but you have to use the right tool , and it happens to be that GAE is not the right tool when you don't have any plan and expect it to adapt magically to your needs... read the fine manual