The world is not perfect

image
The world is not perfect. At any moment something can go wrong. Fortunately, most of us launches rockets into space and builds the aircraft.

Modern man depends on the application in his phone and our task is to do so at any time, under any circumstances, he could open and normal to look at funny cat pictures.

People are not perfect. We constantly make mistakes. Make typos, we can forget something or give in to laziness. People can is banal to drink or to get under the car.

Iron is not perfect. Hard drives die. Data centers are losing channels. The processors overheat and the electric network fail.

The software is not perfect. Memory flows. Connections are torn. Replica break down and the data goes into oblivion.

Shit happens — as our overseas friends. What can we all do? And the answer is banal in its simplicity — nothing. We can always test it, raise a ton of environments, copy production and to keep hundred of thousands of redundant servers, but it still can't save the world is not perfect.

The only right decision here is to accept. You need to accept the world for what it is and minimize losses. Every time when setting up a new service need to remember is that it will break at the most inopportune moment.

It will break for sure. You will make a mistake. Iron will break down. The cluster will break up. And according to the laws of this imperfect world — this will happen just when you least expect.

What makes most of us that would fool everyone (including yourself)? — We set up alerts. We write tricky metrics, collect the logs and create alerts, thousands, hundreds of thousands of alerts. Our mailboxes are overflowing. Our phones lost SMS and phone calls. We plant entire floors of people look at charts. And when once again we lose access to the service, begin dissections: what we forgot themonitoring.

All this a semblance of reliability. No alerts, metrics, and monitoring will not help.

Today called you, and you fixed the service — and no one noticed that something is broken. And tomorrow, you went to the mountains. And the day after tomorrow thumped. People are not perfect. Luckily, we are engineers, live in an imperfect world and learn to defeat it.

So why do we have to Wake up at night or in the morning instead of coffee to read your mail. Why business should depend on one person and his health. Why. I don't understand.

I only know that it is impossible to live, and I don't want to live. And the answer is simple: Automate it (Yes, with capital letters). We don't need alerts and calls at night. We need automatic responses to these messages. We must be sure that the system can fix itself. The system must be flexible and able to change.

Unfortunately, we don't have a smart AI. Fortunately, all our problems are formalized.

I have no silver bullet, but I have a Proof of Concept for AWS.

the

AWS Lambda


Serverless — first of all, that you are not running cannot break.
Event based — event received, processed, turned off.
Can JVM and that means we can use all the experience from the Java world (and that means that I can use Clojure).
3d party — No need to monitor AWS Lambda and maintain.

Pipeline looks as follows:

Event -> SNS Topic -> AWS Lambda> Reaction

By the way, SNS topic can have multiple endpoints. So, it is possible is banal to add mail and get same notice. And you can expand the lambda function and make notifications much more useful: for example, to send alerts immediately, along with the schedules or to add the sending SMS.

Entire example one Lambda functions can be found at: github.com/lowl4tency/aws-lambda-example
The lambda function kills all nodes in the ELB is unable inService.
the

Parsing


In this example we are going to kill all nodes that are not in the InService state. By the way, the whole Lambda function takes ~50 lines of code in one file, so ease of support and ease of entry.

Any project begins with a Clojure project.clj

I used the official Java SDK and a lovely library Amazonica, which is a wrapper for this SDK. Well, that would not carry a lot of excess, exclude those parts of the SDK that we don't need

the
[amazonica "0.3.52" :exclusions [com.amazonaws/aws-java-sdk]]
[com.amazonaws/aws-java-sdk-core "1.10.62"]
[com.amazonaws/aws-lambda-java-core "1.1.0"]
[com.amazonaws/aws-java-sdk-elasticloadbalancing "1.11.26"
:exclusions [joda-time]]
[com.amazonaws/aws-java-sdk-ec2 "1.10.62"
:exclusions [joda-time]]
[com.amazonaws/aws-lambda-java-events "1.1.0"
:exclusions [com.amazonaws/aws-java-sdk-dynamodb
com.amazonaws/aws-java-sdk-kinesis
com.amazonaws/aws-java-sdk-cognitoidentity
com.amazonaws/aws-java-sdk-sns
com.amazonaws/aws-java-sdk-s3]]]

For greater flexibility each Lambda function I use the configuration file with the usual edn. In order to get the opportunity to handle events, we need to slightly change the function Declaration

the
(ns aws-lambda-example.core
(:gen-class :implements [com.amazonaws.services.lambda.runtime.RequestStreamHandler])

Entry point. Read input-event, process the event using the handle-event and writes to the JSON stream as a result.

the
(defn -handleRequest [this is os context]
"Parser of input and genarator of JSON output"
(let [w (io/writer os)]
(- >(io/reader.)
json/read
(- >(io/reader.)
json/read
walk/keywordize-keys
handle-event
(json/write w))
(.flush w))))

Workhorse:

the
(defn handle-event [event]
(let [instances (get-elb-instances-status
(:load-balancer-name
(edn/read-string (slurp (io/resource "config.edn")))))
unhealthy (unhealthy-elb-instances instances)]
(when (seq unhealthy)
(pprint "The next instances are unhealthy: ")
(pprint unhealthy)
(ec2/terminate-instances :instance-ids unhealthy))
{:message (get-in event [:0 Records :Sns :Message])
:elb-instance-ids (mapv :instance-id instances)}))



Get a list of the nodes in the ELB and filtered by status. All nodes that are in the InService state is removed from the list. Other termination.

Everything we print using pprint will go to CloudWatch logs. This can be useful for debug. Since we don't have constantly running lambda and there is no way to connect to it's REPL can be quite useful.

the
 {:message (get-in event [:0 Records :Sns :Message])
:instance-ids (mapv :instance-id instances)}))

In this place the whole structure, kotoruya will Shearim and return from this function will be written in JSON and see the execution result in the Web interface Lambda.

In functions of unhealthy-elb-instances filtered our list and get instance-id only for those nodes that the ELB is considered inoperative. Get a list instances and filter them by tags.

the
(defn unhealthy-elb-instances [instances-status]
(->>
instances-status
(remove #(= (:state %) "InService"))
(map :instance-id)))

In the function get-elb-instances-status invoked the API method and get a list of all nodes, with the status for one particular ELB

the
(defn get-elb-instances-status [elb-name]
(->>
(elb/describe-instance-health :load-balancer-name elb-name)
:instance-states
(map get-health-status )))

For convenience, we remove the extra and generated a list of only the information that is of interest to us. It's instance-id and status of each instance.

the
(defn get-health-status [instance]
{:instance-id (:instance-id instance)
:state (:state, instance)})

And filtered our list by removing those nodes that are in the InService state.

the
(defn unhealthy-elb-instances [instances-status]
(->>
instances-status
(remove #(= (:state %) "InService"))
(map :instance-id)))

And it's all 50 rows that will not Wake up at night and quietly go to the mountains.

the

Deployment


For ease of deployment I use a simple bash-script

the
#!/bin/bash

# Loader AWS Lambda

aws lambda to create-function --debug \
--function-name example \
--handler aws-lambda-example.core \
--runtime Java 8 at \
--memory 256 \
timeout --59 \
--role arn:aws:iam::611066707117:role/lambda_exec_role \
--zip-file fileb://./target/aws-lambda-example-0.1.0-SNAPSHOT-standalone.jar

Custom alert and fasten it to the SNS topic. SNS topic to fasten the lambda as endpoint. Quietly going to the mountains, or get under the car.
By the way, at the expense of flexibility you can program any behavior of the system and not only for systemic, but also by business metrics.

Thank you.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Custom database queries in MODx Revolution

Google Web Mercator: a mixed coordinate system