Datadog

Goals

Have full visibility in to product and see where issues were
Able to 'Point the finger' at a vendor if their API was causing issues

Technology used

Project Breakdown

Client was building a new site and wanted the ability to monitor all external API calls made by their codebase.

My team and I were pushed towards Datadog as it's something we've previously used, and know it will excel here

What I did?

I personally was responsible for:

Setting up the account
Integrating with customers AWS account
Deciding on what resources to ingest
Setting up alerts
Installing agent on servers
Setting up agent plugins
Setting up logging
Reducing logging cost
Communicating with API Vendors to get access
Setting up Vendor synthetic API Monitoring in Datadog

Issues I had to overcome

Server Monitoring

Whilst you're setting up metric based alerting, you have the urge to ingest as many metrics and datapoints as possible. Same can be said for the logging side of things.

As anyone knows, using a cloud service they're usually pay for what you use. And we were.

I was tasked with saving the client some money on Datadog charges, and was able to by doing:

Creating a filter on the Apache logs we ingested
- 10% of http 200 was ingested
- 100% of errors were ingested
Creating a filter on the Cloudfront events being ingested
Removing non-critical services from Datadog
Reducing polling frequency on non-critical infrastructure
Starting the process of re-negotiating the contract to save around $4.8k a year.

API Monitoring

Some vendors use rolling credentials, where the credentials expire, so you're expected to write a credential helper in to your code. Sadly this was not really an option as it would have required needing to run something like a lambda function one every 24 hours to update the datadog terraform code.

Through speaking with the vendor we were able to get them to generate us a key that would not expire (with read only access) so we were able to monitor our clients requested endpoints.

Repeatable code

One of my colleges designed the terraform code to be able to simply and quickly build monitors withing the Datadog dashboard, once in a test environment and then again in production.

All the Datadog Configurations were eventually ported over to jinja2 templates to be used with Ansible for fast and easy building of client servers - However we never got round to fully implementing this as the PMO team prioritized other projects under the client.