Akamai BigQuery Integration
Google Cloud Platform (GCP) BigQuery is a columnar database tool that provides data analysis without having to take care of the underlying infrastructure. It also lets you visualize your data with an integrated tool called Data Monitor.
You can now integrate Akamai DataStream with BigQuery to find meaningful insights, use familiar SQL, and take advantage of a pay-as-you-go model.
Note: You can integrate raw logs and aggregated metrics streams with BigQuery. In this example, we’ll integrate a raw logs stream that pushes data to BigQuery. There are three steps to integrate DataStream with BigQuery:
There are five steps to integrate DataStream with BigQuery:
Get started with DataStream
Set up your API client
Set up a GCP account
Integrate DataStream with BigQuery
Make a DataStream API call
1. Get started with DataStream
In DataStream, configure a raw logs stream and choose your data sets. For example, you can select Request Header Data to choose the headers that you want to receive when calling the API. You may want to receive headers such as Authorization or Range, Accept-Encoding, and many others.
You can also choose a sample rate. Unless justified, you should select 100% to get all the traffic that hits your site.
For details, see the get started section.
2. Set up an API client for the DataStream API
To integrate DataStream with BigQuery, you need an API client with at least read-only access to the DataStream API. To create an API client, navigate to the Identity and Access Management page in Control Center.
Once you’re on the page, create an API client.
Provide a name for your client and grant it access to the Pull Datastream API.
Finally, create credentials for your API client. You can see these credentials only once, so be sure to download or copy them directly into the Datadog integration tile for Akamai.
3. Set up a GCP account
You need to open a new project and start creating the following products:
Set up two buckets. One to store the logs, and the other to store a cloud function script.
Set up one compute workload to call the DataStream API and copy it to cloud storage.
Create a BigQuery database for your logs. You will add a table later.
Once you are done, go to API services in your Google Cloud Platform and enable the following APIs cloud functions: BigQuery and cloud storage.
4. Integrate DataStream with BigQuery
Compute engine setup
SSH into the compute engine that you previously set up. Then, install the Google Cloud API.
For more details, see https://cloud.google.com/sdk/install
Install the Akamai APIs and clients. Copy the previously created Akamai credentials and paste them to the .edgerc file. For more details, see
Next, grant the compute engine access to the GCP resources such as storage, BigQuery, and the cloud function. For more details, see
BigQuery table setup
First, you need to get the DataStream schema. You’ll find it here:
Next, prepare a BigQuery schema that matches the DataStream schema. The BigQuery schema looks exactly like the one here. Note that the schema has a lot of nested records.
Then, use the prepared schema to create a table in BiqQuery. This command uses the schema called schema.json to create a table called edgescapedemo.
bq mk --table akamai-206503:datastream_logs.edgescapedemo ./schema.json
Cloud function setup
You also need to write a cloud function. The cloud function is a serverless computing product. For more details, see https://cloud.google.com/functions/
It can act on triggers. Here, the trigger that we use is google.storage.object.finalize. As soon as something is uploaded to cloud storage, the trigger will fire. For more details, see
Once you’ve prepared the cloud function, you can deploy it with this command:
gcloud beta functions deploy datastream-cloud-function --trigger-resource=akamai-datastream --trigger-event google.storage.object.finalize --source=. --stage-bucket=gs://akamai-script-cloudfunction --entry-point=jsonLoad
5. Call DataStream API
Now all the pieces are in place, you can start your API calls script and push the DataStream JSON response file to cloud storage. Once the file is uploaded to cloud storage, the finalize trigger activates the cloud function and stores the file or your data in a BigQuery table.
Here is the flow:
1.Make an API call for the DataStream APIi from the compute engine. This can be a cron job:
http --auth-type edgegrid -a datastream-pull-api: ":/datastream-pull-api/v1/streams/851/raw-logs?start=2018-10-30T06:30:00Z&end=2019-10-23T06:40:00Z&page=0&size=100"
2. Push the output to the bucket for DataStream logs:
gsutil cp output.json gs://akamai-datastream
As soon as it’s in the bucket, it’ll activate the cloud function. Looking at the cloud function logs, you can verify if it has successfully completed.
You can return the logs with this command:
gcloud beta functions logs read datastream-cloud-function
3. Once it’s done, you can open the BigQuery interface and query the table. You’ll see something similar to this:
Segment number vs download time
In this example, let’s consider a customer who uses DataStream to ingest logs every 5 minutes. The customer has received performance complaints over the past few minutes or has some side statistics showing an increase in load times, such as page loads. Here, the customer can quickly make a query to see the load times for all objects or scripts on his page.
The customer could use BigQuery to get file types together with their download times. This example SQL query returns ts and m3u8 file types:
select d.message.reqPath, CAST(d.netPerf.downloadTime AS INT64) * 1 As dtime FROM `akamai-206503.datastream_logs.edgescapedemo` ,
UNNEST(data) as d
where d.message.reqPath LIKE "%.ts" or d.message.reqPath LIKE "%.m3u8"
order by dTime desc
We can easily point out the files that take longer to download. Then, we can investigate futher and make specific queries about the title, allowing the customer to identify the root cause of the problem.
What’s more, Google Data Studio is an integrated feature, making it easy to visualize any query or table in a dashboard or report. Let’s look at this table. With one click, you can convert it into a graph.
Using an aggregate stream, you can also find out if the numbers of errors have increased. Aggregate data streams retrieve real-time of 4xx and 5xx HTTP error occurrences.
Here is the data stream call:
http --auth-type edgegrid -a datastream-pull-api: ":/datastream-pull-api/v1/streams/1201/aggregate-logs?start=2019-02-15T09:19:37Z&end=2019-02-17T10:40:00Z&aggregateMetric=2xx%2C3xx%2C4xx%2C5xx&page=0&size=100"
You can then ingest the return JSON file into BigQuery and visualize the errors as a time series.