OpenShift V3 Persistent Storage Nagios Plugin



By the time of writing, OpenShift V3 comes with poor monitoring capabilities. The build-in monitoring only checks the metrics of Memory/CPU/Network, and it does not even support alerting! And the lowest granular level only down to last hour. So you have to build your own monitoring if you want to keep close eyes on your services running on OpenShift.

I wrote a Nagios plugin to monitor the persistent storage usage. Here are a few things you need to know:

– The pods come and go, and the volumes are changeable, so the plugin is self-discovery for the given project. It means you only need to provide the project name when config in Nagios.

– The volume checks are per each project, not per each volume. If any one of the volume in the project is over 80%/90% usage, Nagios will trigger warning/critical alerts.

sample config:
define service{
        use                     generic-service
        service_description     Openshift Volume Checks
        servicegroups           openshift-site-checks
        check_command           check-openshift-pv-size!my-site-project
        contact_groups          devops

Great AWS Trusted Advisor



I have to say AWS Trusted Advisor is a great tool! AWS keeps improving it by adding more useful new checks. Here is one that I got this morning:


I setup health checks for some new records, but forgot to decrease TTL to a low value (it is 300 seconds by default). Now trusted advisor reminds that it is better set a value lower than 60 seconds to allow the old DNS records expire soon. How sweet it is 🙂

Fault Tolerant VPN Solution on AWS


, , , , ,

I worked with a project team to help them to improve their current VPN infrastructure on AWS. They have 3 VPN EC2 instances, let’s call them VPN01, VPN02 and VPN03. They are all OpenVPN Access Server, VPN01 and VPN02 both have 10 concurrent sessions license, and in availability a and b respectively. VPN03 only has the 2 complimentary concurrent session license, and it is availability zone c  (it is mostly for emergency use, e.g both AZ-a and AZ-b go down). There is a DNS round robin setting and all three instances have the same configurations, so the end user can dial in any of them. Here are the configuration files:


They just renewed the license, so I have to stick with the current license-based AMI. Otherwise I will use the hourly-rated OpenVPN AMI with ELB and Autoscaling group. As VPN01 and VPN02 have more license, the solution need to make most users to use those two instances. And if the VPN service is not working properly on one instance, the solution needs to divert the user to the healthy instance.

With the requirements in mind, here is my design:


I guess the architecture diagram is self-explanatory. Below are some brief description of how I implemented it:

  1. Setup weighted DNS CNAME records for of  vpn01.mydomain.local (weight 45), vpn02.mydomain.local (weight 45) and vpn03.mydomain.local (weight 10). So there are 45% chances the traffics go to either vpn01 or vpn02, only 10% go to vpn03.
  2. Setup DNS health check for each vpn(01|02|03).mydomain.local. As OpenVPNAS is SSL VPN, we only need to monitor the port 443.Screen Shot 2017-01-16 at 9.02.53 AM.png
  3. Create a new SNS topic, Let’s name it to vpn_healthcheck.
  4. Configure the alarm notification target to a new SNS, so a notification will be sent to SNS if the health check failed.Screen Shot 2017-01-16 at 9.04.34 AM.png
  5. Let’s work on the Lambda function. Firstly you need to setup a role for the function to perform the start or reboot operations. Here is a sample code. Secondly, set up a SNS trigger type Lambda function. I use Python, and here is the source code.Screen Shot 2017-01-16 at 9.44.13 AM.pngScreen Shot 2017-01-16 at 9.07.58 AM.png
  6. Go back to the SNS that is created in step 3, and subscribe it with your email. And subscribe for the Lambda function as well.Screen Shot 2017-01-16 at 9.06.03 AM.png
  7. Testing time – stop the openvpnas service on one of the VPN instance. And wait for 1-2 minute, the instance will be reboot by the Lambda function.Screen Shot 2017-01-16 at 8.55.25 AM.pngCheck the Lambda function log:

    Screen Shot 2017-01-16 at 9.42.08 AM.png

Hope you find it is useful for you. All sample codes can be found in my Github repo.



, , ,

I received good feedbacks since shared the SSSG-Ninja in Akamai community, so I decided to share another useful tool that I ever built before.

Akamai-Bot is a Hubot based automation bot that allows users to perform some Akamai daily tasks simply by chatting.

Here are some examples. If you are interested, here is the git repo. The docker image is available too.