dfxSuccessful cluster administration can be very difficult without a real-time view of the state of the cluster. Solr itself does not provide aggregated views about its state or any historical usage data, which is necessary to understand how the service is used and how it is performing. Knowing the throughput and capacities not only helps detect errors and troubleshoot issues, but is also useful for capacity planning.

Questions may arise, such as:

  • What is the size of my cluster and each collection? How fast does it grow?
  • What is the query rate on my cluster and collections?
  • How many documents do I have in each collection?
  • What is the performance of my indexers?
  • Are my shards balanced?

Answering questions like these requires detailed and historical collection of metrics.

With Cloudera Manager, users have been able todeploySolr services on CDH and monitor its health since Solr was first integrated. However, the initial monitoring capabilities did not fully answer the above questions in large Solr cluster deployments, often with multi-tenant applications under an SLA being served by CDH and Cloudera Search.

In this post we present thenew and improvedcapabilities available in Cloudera Manager 5.12 to monitor and troubleshoot Cloudera Search clusters - beyond just server health. We will demonstrate how to access existing charts and set up dashboards and alerts. But first, let's review the existing powerful capabilities in Cloudera Manager (CM) that collect rich metrics and allow you to create ad-hoc insight-providing dynamic dashboards.

Metrics in Cloudera Manager

Cloudera Manager continuously monitors and collects usage and performance metrics from Solr (and other services running on the shared-storage cluster). The collected metrics are accessible through the Chart Builder feature in Cloudera Manager, where you can build charts and create alerts based on them. Cloudera Manager provides predefined charts with a handful of essential metrics about the cluster's health that will be demonstrated later in this blog post.

The metrics are collected (and documented) at theservice,server,shard, andreplicalevels, dependending on the nature of the metric. For example, the JVM heap size is a server-level metric, whereas the query request rate is measured at the core/replica level.

Visualizing metrics

Cloudera Manager already supports creation of ad-hoc queries on collected metrics. The syntax of the query language is SQL-like, making it easy to learn to run custom queries.

We can run custom queries by selecting Chart > Chart Builderfrom the Cloudera Manager menu. In the Chart Builder interface we enter the query. For example, we can enter:

1

select select_requests_rate

This query shows the historical request rate of every replica of every collection on every Solr service that is being managed. We can filter these statistics to a specific service or collection:

1

select select_requests_rate where serviceName='SOLR-1'

Or

1

select select_requests_rate where solrCollectionName='collection1'

The filters will select only those replicas that belong to the given service or collection. However, if we want to see an aggregated total of the request rates, we need to use a different approach.

Cloudera Manager creates artificial aggregated metrics for your convenience. The aggregated metrics are summaries of metrics over a certain grouping. For example, the metric is aggregated into , a sum of over a shard, collection, or service. We can select the desired aggregation by filtering metrics bycategory. The example below returns the aggregated for each shard within the given collection:

1

select total_select_requests_rate_across_solr_replicas where solrCollectionName='collection1' category='SOLR_SHARD'

The following query shows the total for each collection:

1

select total_select_requests_rate_across_solr_replicas where category='SOLR_COLLECTION'

We can also get the sum of all for the whole service using this query:

1

select total_select_requests_rate_across_solr_replicas where category='SERVICE'

By using the filter, we can specify the aggregation level for the metrics. You may want to experiment withother metrics listed in the documentationto find the ones for your specific needs.

You can learn more about tsqueryfrom the documentation.

Predefined charts for Solr

In Cloudera Manager 5.12, we introduce a set ofnew and improvedcharts for monitoring Solr services. The Solr service status page in this release contains 8 essential charts:

  • Request Rate: These three charts are summaries and statistical distributions of , , and request rates.
  • Average Response Time: These three charts display the distribution of average response times for the , , and request types.
  • Index Size: The aggregated index size of the cluster, along with the distribution of index sizes among all cores.
  • Total Documents: The aggregated number of documents, along with the distribution of document counts among all cores.
[Attachment]

Figure 1: New charts on Solr service status page in Cloudera Manager 5.12

These 8 new charts help administrators quickly get an overview of the cluster performance.

Collection Statistics Page

The Collection Statisticspage under the Solr service shows more detailed metrics about the cluster and collections. Similarly to the new service status page charts, request rates, average response times, and index sizes are shown at the collection level. These detailed charts are helpful for monitoring the performance and usage of each collection.

In Cloudera Manager 5.12, this page got a minor facelift. The histograms showing the current state only have been replaced with historical diagrams that help visualize the changes over time.

[Attachment]

Figure 2: Improved Collection Statistics page summary in Cloudera Manager 5.12

We can also see collection-level statistics. By selecting a specific collection from the side menu, we can access a more detailed view of that collection.

The collection view shows charts for the selected collection only. Index size and document count is displayed at the shard level among the total. This page also has a Cache Hit Ratiochart showing the historical cache efficiency of the document cache, field value cache, filter cache, and query result cache.

[Attachment]

Figure 3: Improved Collection Statistics page collection view in Cloudera Manager 5.12

Monitoring and Troubleshooting

Let's take a look at a few scenarios where we can troubleshoot or detect unexpected usage using the charts and metrics introduced above.

Sudden request rate change

A significant drop in the total request rate could indicate a malfunctioning client application, network issue, or misconfiguration. A sudden drop in the update request rate can indicate an indexer job error. We can use the improved Collection Statistics page to see which collection has a recent decrease in request rate.

Average response time increase

One of the key indicators of performance is the time it takes for a request to be served. If there is a significant increase in the average response times, it might indicate a performance bottleneck or other malfunction. On the service status page charts, the maximum value of the average response times indicates the core/replica that is performing the slowest on average. By expanding the chart (double-arrow at the top right corner) and selecting a point on the maximum line, we can see which replica is reporting that average response time. We can then investigate the given collection to determine whether the performance drop is localized to the replica or collection, or if it affects the entire cluster.

[Attachment]

Figure 4: Detailed view of Update Average Response Time chart in Cloudera Manager 5.12

For example, figure 4 shows that replica has a higher average response time on updates than the rest of the cluster.

Index distribution

On the Collection Statistics page, under each collection, we can see the index size and document count at the shard level. In an ideal situation, the document count (and therefore, the index size) is evenly distributed. If the shard-key contains an account ID (or any other unique prefix), the distribution of documents among shards can correlate with the document count of that account ID. Thus, if an account ID has significantly more documents than others, the shard it belongs to will also have more documents. That could lead to uneven resource utilization and performance issues. The collection-level chart of index sizes and document counts can help you easily identify unbalanced shards.

Setting up alerts on Solr metrics

Proactive monitoring of a Solr service is much easier using metrics and predefined charts. In addition to periodically viewing charts, it is useful to set up alerts and notifications for certain scenarios that we want to monitor.

Cloudera Manager supports triggersthat let you track performance and perform predefined actions when conditions are met. You can create any tsquery on any metrics and set the health state depending on the returned values. For example, you can create a trigger that sets the cluster state to Concerningwhenever the average response time goes above 2 seconds.

You can create triggers using custom queries, or you can use queries from existing charts. In this example, we are going to create a trigger based on the Select Average Response Timechart found on the Solr service status page.

  1. Click on the 'gear' icon in the corner of the Select Average Response Timechart. [Attachment]
  2. Enter a Namefor the trigger (for example, 'Slow select requests').
  3. Modify the pre-populated trigger formula such that the query only returns values over 0.5 seconds. [Attachment]
  4. Click Create Trigger.

The newly created trigger will set the Solr service state to Concerningwhenever the highest average response time among replicas exceed 0.5 seconds.

[Attachment]

Creating Custom Dashboards

The default diagrams show essential information about the Solr service.We can also create custom dashboardsby selecting the metrics that best suit our requirements.

As a demonstration, we are going to create a Service Level dashboard that displays 99th percentile response times for different query types on each collection.

  1. Select Charts > Dashboardsin Cloudera Manager.
  2. Enter a dashboard name (for example: 'Solr Service Level').
  3. Click the Createbutton.
  4. Click the View Dashboardlink.

This creates an empty dashboard. We are going to add charts with the metric that measures the slowest request time of the fastest 99% of requests. In other words, only 1% of the requests are served slower than the reported value. You can also use 75th, 95th, or 99.9th percentiles.

  1. On the Service Level dashboard page, click the Add Chartbutton.
  2. Enter the following tsquery:

    1

    select select_99th_pc_request_time_across_solr_replicas where category='SOLR_COLLECTION'

  3. Enter a title: Select Request Time 99th Percentile
  4. Optionally, you can select under facets All Separateto have one chart per collection.
  5. Click Save.
  6. Repeat 1-5, changing the metric prefix from to and changing the title accordingly.

We now have a custom dashboard displaying Solr service level metrics for select and update requests.

[Attachment]

Summary

Cloudera Manager is a powerful tool to utilize efficiently the metrics exposed by Solr and for other services running in a CDH deployment. Usage of the custom CM queries allows customization for specific requirements.

In this blog you have learned how to use the custom CM queries to aggregate metrics. You have gotten a highlight of all the available new monitoring and stats charts pre-generated and available for more granular Cloudera Search monitoring and troubleshooting. You've also learned a bit about how these charts can help specific troubleshooting scenarios as well as how to set up triggers. We hope that all this new information and new capabilities will help you in your production environment and we look forward to your feedback!

If you would like to learn more about the subject, you can read the corresponding documentation.

Cloudera Inc. published this content on 20 July 2017 and is solely responsible for the information contained herein.
Distributed by Public, unedited and unaltered, on 20 July 2017 16:03:13 UTC.

Original documenthttp://blog.cloudera.com/blog/2017/07/quicker-insight-into-apache-solr-and-collection-health/

Public permalinkhttp://www.publicnow.com/view/FEECE2DC08381A1FE4FC8A0B81E22FBF7395C35E