Customer Metrics Integration (CMI)

AuraDB Virtual Dedicated Cloud AuraDS Enterprise AuraDB Business Critical

An application performance monitoring system can be configured to fetch metrics of AuraDB instances of types:

AuraDB Virtual Dedicated Cloud
AuraDS Enterprise
AuraDB Business Critical

This gives users access to their Neo4j Aura instance metric data for monitoring purposes.

Analyzing the metrics data allows users to:

Optimize their Neo4j load
Adjust Aura instance sizing
Set up notifications

Process overview

Detailed steps

Log in to Aura as project admin.
Make sure there is a dedicated Aura user to use for fetching metrics. You can either:
- Create a new user:
  1. In "User Management" of Neo4j Aura, invite a new user, selecting "Metrics Integration Reader" as a role.
  2. Follow the invitation link and log in to Neo4j Aura.
  3. Confirm the project membership.
- Or you can find an existing user in "User Management" and change its role to "Metrics Integration Reader"
  
  Capabilities of users with the role "Metrics Integration Reader" are limited to fetching the metrics and getting a read-only view of the project.
Ensure you are logged in to Aura as the user selected in the previous step. In "Account Details", create new Aura API credentials. Save client secret.
Configure the APM system to fetch metrics from the URL(s) or configuration templates shown in "Metrics Integration" of Neo4j Aura. Use oauth2 type of authentication specifying the Client ID and Client Secret created in the previous step. See examples for Prometheus and Grafana and Datadog below.
Use the APM system to create visualizations, dashboards, and alarms based on Neo4j metrics.

Security

Metrics for a Neo4j Aura instance are only returned if all the following are true:

Authorization header of the metrics request contains a valid token.
The token was issued for an Aura user with "Metrics Integration Reader" role.
Project has instances of types Enterprise (Virtual Dedicated Cloud) or Business Critical.
The specified instance belongs to the specified project.

The legacy term Enterprise is still used within the codebase and API. However, in the Aura console and documentation, the AuraDB Enterprise project type is now known as AuraDB Virtual Dedicated Cloud.

Revoke access to metrics

To revoke a user’s access to metrics of a specific project, remove the user from that project in "User Management". After that, the user still exists but its connection to the project is removed.

The revocation described takes effect after the authorization caches expire, which takes approximately 5 minutes. It results in HTTP 401 being returned, along with the message User doesn’t have access to Metrics resources. However, if you remove only the Aura API credentials used to retrieve metrics, the revocation will take effect only after the tokens issued with these credentials expire, as no new token can be issued anymore. Currently used token expiration time is 1 hour.

Metric labels

Depending on the metric, the following labels are applied:

aggregation: the aggregation used to calculate the metric value, set on every metric. Since the Neo4j instance is deployed as a Neo4j cluster, aggregations are performed to combine values from all relevant cluster nodes. The following aggregations are used: MIN, MAX, AVG and SUM.
instance_id: the Aura instance ID the metric is reported for, set on every metric.
database: the name of the Neo4j database the metric is reported for. Set to neo4j by default.

Example

# HELP neo4j_database_count_node The total number of nodes in the database.
# TYPE neo4j_database_count_node gauge
neo4j_database_count_node{aggregation="MAX",database="neo4j",instance_id="78e7c3e0"} 778114.000000 1711462853000

Looking up metric name in Neo4j Aura Advanced Metrics

In Neo4j Aura Advanced Metrics, it is possible to find out the metric name that corresponds to the chart, by using the chart menu item "Metrics Integration" as shown.

Metric scrape interval

Recommended scrape interval for metrics is in the range of 30 seconds up to 2 minutes, depending on requirements. The metrics endpoint caches metrics for 30 seconds.

Example using Prometheus

Install Prometheus

One way is to get a tarball from https://prometheus.io/docs/prometheus/latest/installation/

Configure Prometheus

To monitor one or more instances, add a section to the Prometheus configuration file prometheus.yml.

Copy the configuration section proposed in Metrics Integration, as shown.

Replace the placeholders <AURA_CLIENT_ID> and <AURA_CLIENT_SECRET> with corresponding values created in the previous step.

For details, see Prometheus configuration reference.

Start Prometheus

./prometheus --config.file=prometheus.yml

Test that metrics are fetched

Open http://localhost:9090 and enter a metric name or expression in the search field (ex. neo4j_aura_cpu_usage).

Use Grafana

Install and configure Grafana, adding the endpoint of the Prometheus instance configured in the previous step as a data source. You can create visualizations, dashboards, and alarms based on Neo4j metrics.

Example using Datadog

Configure an endpoint with token authentication

Edit /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml as follows:

Replace the placeholders <ENDPOINT_URL>, <AURA_CLIENT_ID> and <AURA_CLIENT_SECRET> with corresponding values from the previous steps.

/etc/datadog-agent/conf.d/openmetrics.d/conf.yaml

init_config:
instances:
  - openmetrics_endpoint: <ENDPOINT_URL>
    metrics:
      - neo4j_.*
    auth_token:
      reader:
        type: oauth
        url: https://api.neo4j.io/oauth/token
        client_id: <AURA_CLIENT_ID>
        client_secret: <AURA_CLIENT_SECRET>
      writer:
        type: header
        name: Authorization
        value: "Bearer <TOKEN>"

For details, see Datadog Agent documentation and configuration reference.

Test that metrics are fetched

sudo systemctl restart datadog-agent
Watch /var/log/datadog/* to see if fetching metrics happens or if there are warnings regarding parsing the configuration.
Check in Datadog metric explorer to see if metrics appear (after a couple of minutes).

Programmatic support

Aura API for Metrics Integration

Aura API supports fetching metrics integration endpoints using:
- endpoint /tenants/{tenantId}/metrics-integration (for project metrics)
- JSON property metrics_integration_url as part of /instances/{instanceId} response (for instance metrics)
Reference: Aura API Specification

Project replaces Tenant in the console UI and documentation. However, in the API, tenant remains the nomenclature.

Aura CLI for Metrics Integration

Aura CLI has a subcommand for tenants command to fetch project metrics endpoint:

aura projects get-metrics-integration --tenant-id <YOUR_PROJECT_ID>

# example output
{
  endpoint: "https://customer-metrics-api.neo4j.io/api/v1/<YOUR_PROJECT_ID>/metrics"
}

# extract endpoint
aura projects get-metrics-integration --project-id <YOUR_PROJECT_ID> | jq '.endpoint'

For instance metrics endpoint, Aura CLI instances get command JSON output includes a new property metrics_integration_url:

aura instances get --instance-id <YOUR_INSTANCE_ID>

# example output
{
    "id": "id",
    "name": "Production",
    "status": "running",
    "tenant_id": "YOUR_PROJECT_ID",
    "cloud_provider": "gcp",
    "connection_url": "YOUR_CONNECTION_URL",
    "metrics_integration_url": "https://customer-metrics-api.neo4j.io/api/v1/<YOUR_PROJECT_ID>/<YOUR_INSTANCE_ID>/metrics",
    "region": "europe-west1",
    "type": "enterprise-db",
    "memory": "8GB",
    "storage": "16GB"
  }

# extract endpoint
aura instances get --instance-id <YOUR_INSTANCE_ID> | jq '.metrics_integration_url'

Reference: Aura CLI cheatsheet

Metrics granularity

The metrics returned by the integration endpoint are grouped based on the labels provided: aggregation, instance_id, and database.

An Aura instance typically runs on multiple servers to achieve availability and workload scalability. These servers are deployed across different Cloud Provider availability zones in the user-selected region.

Metrics Integration supports a more granular view of the Aura instance metrics with additional data points for availability zone & instance mode combinations. This view can be enabled on demand.

Contact Customer Support to enable more granular metrics of instances for your project.

There may be a delay in more granular metrics being available when a new Aura instance is created. This is because of the way 'availability zone' data is collected.

Example metric data points

neo4j_aura_cpu_usage{aggregation="MAX",instance_id="a59d71ae",availability_zone="eu-west-1a",instance_mode="PRIMARY"} 0.025457 1724245310000
neo4j_aura_cpu_usage{aggregation="MAX",instance_id="a59d71ae",availability_zone="eu-west-1b",instance_mode="PRIMARY"} 0.047088 1724245310000
neo4j_aura_cpu_usage{aggregation="MAX",instance_id="a59d71ae",availability_zone="eu-west-1c",instance_mode="PRIMARY"} 0.021874 1724245310000

Additional metric labels

availability_zone - User selected Cloud provider zone.
instance_mode - PRIMARY based on user selected workload requirement of reads and writes. (Minimum 3 primaries per instance)

Usage

The following is an example of gaining more insights into your Aura instance CPU usage for capacity planning:

Example PromQL query to plot

max by(availability_zone) (neo4j_aura_cpu_usage{instance_mode="PRIMARY"}) / sum by(availability_zone) (neo4j_aura_cpu_limit{instance_mode="PRIMARY"})

Figure 1. Chart shows CPU usage of primaries by availability zone

Metric definitions

Out of Memory Errors
Metric name	`neo4j_aura_out_of_memory_errors_total`
Description	The total number of Out of Memory errors for the instance. Consider increasing the size of the instance if any OOM errors.
Metric type	Counter
Default aggregation	`SUM`

CPU Available
Metric name	`neo4j_aura_cpu_limit`
Description	The total CPU cores assigned to the instance nodes.
Metric type	Gauge
Default aggregation	`MAX`

CPU Usage
Metric name	`neo4j_aura_cpu_usage`
Description	CPU usage (cores). CPU is used for planning and serving queries. If this metric is constantly spiking or at its limits, consider increasing the size of your instance.
Metric type	Gauge
Default aggregation	`MAX`

Storage Total
Metric name	`neo4j_aura_storage_limit`
Description	The total disk storage assigned to the instance.
Metric type	Gauge
Default aggregation	`MAX`

Heap Used
Metric name	`neo4j_dbms_vm_heap_used_ratio`
Description	The percentage of configured heap memory in use. The heap space is used for query execution, transaction state, management of the graph etc. The size needed for the heap is very dependent on the nature of the usage of Neo4j. For example, long-running queries, or very complicated queries, are likely to require a larger heap than simpler queries. To improve performance, the heap should be large enough to sustain concurrent operations. This value should not exceed 80% for long periods, short spikes can be normal. In case of performance issues, you may have to tune your queries and monitor their memory usage, to determine whether the heap needs to be increased. If the workload of Neo4j and performance of queries indicates that more heap space is required, consider increasing the size of your instance. This helps avoid unwanted pauses for garbage collection.
Metric type	Gauge
Default aggregation	`MAX`

Page Cache Hit Ratio (per minute)
Metric name	`neo4j_dbms_page_cache_hit_ratio_per_minute`
Description	The percentage of times data required during query execution was found in memory vs needing to be read from disk. Ideally the whole graph should fit into memory, and this should consistently be between 98% and 100%. If this value is consistently or significantly under 100%, check the page cache usage ratio to see if the graph is too large to fit into memory. A high amount of insert or update activity on a graph can also cause this value to change.
Metric type	Gauge
Default aggregation	`AVG`

Page Cache Usage Ratio
Metric name	`neo4j_dbms_page_cache_usage_ratio`
Description	The percentage of the allocated page cache in use. If this is close to or at 100%, then it is likely that the hit ratio will start dropping, and you should consider increasing the size of your instance so that more memory is available for the page cache.
Metric type	Gauge
Default aggregation	`MIN`

Bolt Connections Running
Metric name	`neo4j_dbms_bolt_connections_running`
Description	The total number of Bolt connections that are currently executing Cypher transactions and returning results. This is a set of snapshots over time and may appear to spike if workloads are all completed quickly.
Metric type	Gauge
Default aggregation	`MAX`

Bolt Connections Idle
Metric name	`neo4j_dbms_bolt_connections_idle`
Description	The total number of Bolt connections that are connected to the Aura database but not currently executing Cypher or returning results.
Metric type	Gauge
Default aggregation	`MAX`

Bolt Connections Closed
Metric name	`neo4j_dbms_bolt_connections_closed_total`
Description	The total number of Bolt connections closed since startup. This includes both properly and abnormally ended connections. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Bolt Connections Opened
Metric name	`neo4j_dbms_bolt_connections_opened_total`
Description	The total number of Bolt connections opened since startup. This includes both successful and failed connections. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Garbage Collection Young Generation
Metric name	`neo4j_dbms_vm_gc_time_g1_young_generation_total`
Description	Shows the total time since startup spent clearing up heap space for short lived objects. Young garbage collections typically complete quickly, and the Aura instance waits while the garbage collector is run. High values indicate that the instance is running low on memory for the workload and you should consider increasing the size of your instance.
Metric type	Counter
Default aggregation	`MAX`

Garbage Collection Old Generation
Metric name	`neo4j_dbms_vm_gc_time_g1_old_generation_total`
Description	Shows the total time since startup spent clearing up heap space for long-lived objects. Old garbage collections can take time to complete, and the Aura instance waits while the garbage collector is run. High values indicate that there are long-running processes or queries that could be optimized, or that your instance is running low on CPU or memory for the workload and you should consider reviewing these metrics and possibly increasing the size of your instance.
Metric type	Counter
Default aggregation	`MAX`

Replan Events
Metric name	`neo4j_database_cypher_replan_events_total`
Description	The total number of times Cypher has replanned a query since the server started. If this spikes or is increasing, check that the queries executed are using parameters correctly. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Active Read Transactions
Metric name	`neo4j_database_transaction_active_read`
Description	The number of currently active read transactions.
Metric type	Gauge
Default aggregation	`MAX`

Active Write Transactions
Metric name	`neo4j_database_transaction_active_write`
Description	The number of active write transactions.
Metric type	Gauge
Default aggregation	`MAX`

Committed Transactions
Metric name	`neo4j_database_transaction_committed_total`
Description	The total number of committed transactions since the server was started. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Peak Concurrent Transactions
Metric name	`neo4j_database_transaction_peak_concurrent_total`
Description	The highest number of concurrent transactions detected since the server started. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Transaction Rollbacks
Metric name	`neo4j_database_transaction_rollbacks_total`
Description	The total number of rolled-back transactions. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Checkpoint Events
Metric name	`neo4j_database_check_point_events_total`
Description	The total number of checkpoint events executed since the server started. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Checkpoint Events Cumulative Time
Metric name	`neo4j_database_check_point_total_time_total`
Description	The total time in milliseconds spent in checkpointing since the server started. This value may drop if background maintenance is performed by Aura.
Metric type	Counter
Default aggregation	`MAX`

Last Checkpoint Duration
Metric name	`neo4j_database_check_point_duration`
Description	The duration of the last checkpoint event. Checkpoints should typically take several seconds to several minutes. Values over 30 minutes warrant investigation.
Metric type	Gauge
Default aggregation	`MAX`

Relationships
Metric name	`neo4j_database_count_relationship`
Description	The total number of relationships in the database.
Metric type	Gauge
Default aggregation	`MAX`

Nodes
Metric name	`neo4j_database_count_node`
Description	The total number of nodes in the database.
Metric type	Gauge
Default aggregation	`MAX`

Store Size Database
Metric name	`neo4j_database_store_size_database`
Description	Amount of disk space reserved to store user database data, in bytes. Ideally, the database should all fit into memory (page cache) for the best performance. Keep an eye on this metric to make sure you have enough storage for today and for future growth. Check this metric with page cache usage to see if the data is too large for the memory and consider increasing the size of your instance in this case.
Metric type	Gauge
Default aggregation	`MAX`

Page Cache Evictions
Metric name	`neo4j_dbms_page_cache_evictions_total`
Description	The number of times data in memory is being replaced in total. A spike can mean your workload is exceeding the instance’s available memory, and you may notice a degradation in performance or query execution errors. Consider increasing the size of your instance to improve performance if this metric remains high.
Metric type	Counter
Default aggregation	`MAX`

Successful Query Executions
Metric name	`neo4j_db_query_execution_success_total`
Description	The total number of successful queries executed on this database.
Metric type	Counter
Default aggregation	`SUM`

Query Execution Failures
Metric name	`neo4j_db_query_execution_failure_total`
Description	The total number of failed queries executed on this database.
Metric type	Counter
Default aggregation	`SUM`

Query Latency 99th Percentile
Metric name	`neo4j_db_query_execution_internal_latency_q99`
Description	The query execution time in milliseconds where 99% of queries executed faster than the reported time.
Metric type	Gauge
Default aggregation	`MAX`

Query Latency 75th Percentile
Metric name	`neo4j_db_query_execution_internal_latency_q75`
Description	The query execution time in milliseconds where 75% of queries executed faster than the reported time.
Metric type	Gauge
Default aggregation	`MAX`

Query Latency 50th Percentile
Metric name	`neo4j_db_query_execution_internal_latency_q50`
Description	The query execution time in milliseconds where 50% of queries executed faster than the reported time. This also corresponds to the median of the query execution time.
Metric type	Gauge
Default aggregation	`MAX`

Last Committed Transaction ID
Metric name	`neo4j_database_transaction_last_committed_tx_id_total`
Description	The id of the last committed transaction. Track this for primary cluster members of your Aura instance. It should show overlapping, ever-increasing lines and if one of the lines levels off or falls behind, it is clear that this cluster member is no longer replicating data, and action is needed to rectify the situation.
Metric type	Counter
Default aggregation	`MAX`

Cluster Leader (only included if high granularity is turned on)
Metric name	neo4j_cluster_raft_is_leader
Description	Is this server the leader? Track this for each rafted member in the cluster. It reports 0 if it is not the leader and 1 if it is the leader. The sum of all of these should always be 1. However, there are transient periods in which the sum can be more than 1 because more than one member thinks it is the leader. Action may be needed if the metric shows 0 for more than 30 seconds.
Metric type	Gauge
Default aggregation	`MAX`