Skip to content

Datadog

Goals

  • Have full visibility in to product and see where issues were
  • Able to 'Point the finger' at a vendor if their API was causing issues

Technology used

Terraform GitHub Linux Confluence Jira Datadog

Project Breakdown

Client was building a new site and wanted the ability to monitor all external API calls made by their codebase.

My team and I were pushed towards Datadog as it's something we've previously used, and know it will excel here

What I did?

I personally was responsible for:

  • Setting up the account
  • Integrating with customers AWS account
  • Deciding on what resources to ingest
  • Setting up alerts
  • Installing agent on servers
  • Setting up agent plugins
  • Setting up logging
  • Reducing logging cost
  • Communicating with API Vendors to get access
  • Setting up Vendor synthetic API Monitoring in Datadog


Issues I had to overcome

Server Monitoring

Whilst you're setting up metric based alerting, you have the urge to ingest as many metrics and datapoints as possible. Same can be said for the logging side of things.

As anyone knows, using a cloud service they're usually pay for what you use. And we were.

I was tasked with saving the client some money on Datadog charges, and was able to by doing:

  • Creating a filter on the Apache logs we ingested
    • 10% of http 200 was ingested
    • 100% of errors were ingested
  • Creating a filter on the Cloudfront events being ingested
  • Removing non-critical services from Datadog
  • Reducing polling frequency on non-critical infrastructure
  • Starting the process of re-negotiating the contract to save around $4.8k a year.


API Monitoring

Some vendors use rolling credentials, where the credentials expire, so you're expected to write a credential helper in to your code. Sadly this was not really an option as it would have required needing to run something like a lambda function one every 24 hours to update the datadog terraform code.

Through speaking with the vendor we were able to get them to generate us a key that would not expire (with read only access) so we were able to monitor our clients requested endpoints.


Repeatable code

One of my colleges designed the terraform code to be able to simply and quickly build monitors withing the Datadog dashboard, once in a test environment and then again in production.

All the Datadog Configurations were eventually ported over to jinja2 templates to be used with Ansible for fast and easy building of client servers - However we never got round to fully implementing this as the PMO team prioritized other projects under the client.