Log Monitoring and Alerting with Grafana Loki

Log Monitoring and Alerting with Grafana Loki by ruturaj header img

In a production environment, a downtime of even a few microseconds is intolerable. Debugging such issues is time-critical. Proper logging and monitoring of infrastructure help in debugging such scenarios. It also helps in optimizing cost and other resources proactively. It also helps to detect any impending issue which may arise in the near future. There are various logging and monitoring solutions available in the market. In this post, we will walk through the steps to deploy Grafana Loki in a Kubernetes environment. This is due to its seamless compatibility with Prometheus; a widely used software for collecting metrics. Grafana Loki consists of three components Promtail, Loki, and, Grafana (PLG) which we will see in brief before proceeding to the deployment. This article provides a better insight into the architectural differences of PLG and other primary logging and monitoring stack like Elasticsearch-FluentD-Kibana (EFK).

Logging, Monitoring and Alerting with Grafana Loki

Before proceeding with the steps for deploying Grafana Loki, we will see each tool briefly.

Promtail

Promtail helps to monitor applications by shipping the container logs to Loki or Grafana cloud. This process primarily involves discovering targets, attaching labels to log streams from both log files and the systemd journal, and shipping them to Loki. Promtail’s service discovery is based on the Prometheus’ service discovery mechanism.

Loki

As quoted by creators of Loki, Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. Loki uses the same service discovery mechanism as that of Prometheus and adds labels to the log stream instead of indexing. Due to which, logs received from Promtail consist of the same set of labels as that of application metrics. Thus, it not only provides better context switching between logs and metrics but also avoids full index logging.

Grafana

Grafana is an open-source platform for monitoring and observability. It specifically operates on time-series data coming from sources like Prometheus and Loki. Moreover, it allows you to query, visualize, alert on the metrics regardless of its stored location. It helps to create, explore, and share dashboards and encourages data-driven culture.

Deploying the Promtail-Loki-Grafana

To deploy this stack, we will use the respective Helm charts and modify their values according to our requirements. At first, we will install Prometheus-Operator (since it includes Prometheus, Alert-Manager, and Grafana). Then, we will install Loki-stack (which includes Promtail) following with Nginx.

Loki

For deploying Loki, let us create a loki-stack-values.yaml and add the values as per our requirement.

# Loki Stack Values
promtail:
    serviceMonitor:
      enabled: true
      additionalLabels:
        app: prometheus-operator
        release: prometheus
  
    pipelineStages:
    - docker: {}
    - match:
        selector: '{app="nginx"}'
        stages:
        - regex:
            expression: '.*(?P<hits>GET /.*)'
        - metrics:
            nginx_hits: 
              type: Counter
              description: "Total nginx requests"
              source: hits
              config:
                action: inc

We enable service monitor for the Promtail and add appropriate labels so that it syncs with the Prometheus easily. Loki transforms a log line, changes its labels, and change the format of the timestamp using the various stages structured in a pipeline as given in here. In the snippet shown above, there is a match stage that runs only if the scraped config has key=app and value=nginx. The next stage filters out the log line that contains the word hits followed by GET with the help of regular expression.

In the metrics stage, we define the nginx_hits metric. Promtail exposes this custom metric through its /metrics endpoint. We have defined the metric type as a counter that will get incremented  (as defined in action) when hits (source) are filtered out from the regex stage. In order to reflect this metric in Prometheus; Prometheus should scrape Promtail. In this way, we can create a custom metric from the container logs. Moreover, we can also scale the deployments based on such custom metrics as explained in this post.

To install Loki run the following commands:

helm3 repo add loki https://grafana.github.io/loki/charts
helm3 repo update
helm3 upgrade --install loki loki/loki-stack --values=loki-stack-values.yaml

Prometheus-Operator

Create a prom-oper-values.yaml and add the following values to it:

grafana:
    additionalDataSources:
      - name: loki
        access: proxy
        orgId: 1
        type: loki
        url: http://loki:3100
        version: 1
  
additionalPrometheusRules:
  - name: loki-alert
    groups:
      - name: test_nginx_logs
        rules:
        - alert: nginx_hits
          expr: sum(increase(promtail_custom_nginx_hits[1m])) > 2
          for: 2m

alertmanager:
  config:
    global:
      resolve_timeout: 1m
    route:
      group_by: ['alertname']
      group_wait: 3s
      group_interval: 5s
      repeat_interval: 1m
      receiver: slack-notifications
      routes:
      - match:
          alertname: nginx_hits
        receiver: slack-notifications
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/ABCD1234/ABCD567'
        channel: '#test'

In this, we have added configuration for the Grafana to add Loki as the data source. Next, we have configured the rule to push alert with the name nginx_hits. These rules are grouped together to execute them sequentially at regular intervals. The threshold for firing an alert is set in the expr. The rule defined in this way helps to precompute expressions, owing to which you can have faster execution of expressions.

The PromQL expression (expr) defined as threshold, evaluates at the current time. It records the result as a new set of time series under the defined metric name. The expression which we have used consists of increase function. It calculates the difference between the counter values over a period of time (1 min). sum aggregates the nginx_hits from multiple counters. An alert is fired when the condition in the expr persists for a given period of time. The for clause helps Prometheus to verify whether the alert is active during each evaluation time and fires alert accordingly. Alerts remain in Pending state before entering into the Firing state.

Lastly, we configure Alertmanager to send the notification over Slack. Alertmanager follows a tree structure for routing the incoming alerts. We can form groups of alerts based on alertname, jobs, cluster, etc. A notification is sent on the pre-defined receiver when a matching alert is identified. In this case, we have defined the receiver as slack-notifications. receiver consists of configuration for the required Slack channel and its webhook URL.

helm3 upgrade --install prometheus stable/prometheus-operator --values=prom-oper-values.yaml

Next, we deploy Nginx and the yaml for the same can be found here. Nginx is exposed via NodePort.

kubectl apply -f nginx.yaml

We will once verify the running status of all the Pods.

└─ $ ▶ kubectl get pods
alertmanager-prometheus-prometheus-oper-alertmanager-0   2/2     Running   0          3m14s
loki-0                                                   1/1     Running   0          82s
loki-promtail-n494s                                      1/1     Running   0          82s
nginx-deployment-55bcb6c8f7-f8mhg                        1/1     Running   0          42s
prometheus-grafana-c4bcbbd46-8npgt                       2/2     Running   0          3m44s
prometheus-kube-state-metrics-6d6fc7946-ts5z4            1/1     Running   0          3m44s
prometheus-prometheus-node-exporter-jwz64                1/1     Running   0          3m44s
prometheus-prometheus-oper-operator-f8df9fcf9-kc5wc      2/2     Running   0          3m44s
prometheus-prometheus-prometheus-oper-prometheus-0       3/3     Running   1          3m2s

Logging, Monitoring and Alerting

To simulate the condition for firing the alert we will run the following shell commands. This will hit the Nginx URL after every 10 seconds.

└─ $ ▶ while true; do curl --silent --output /dev/null --write-out '%{http_code}' $(minikube service nginx --url); sleep 10; echo; done
200
200

After login into the Grafana, click on the Explore tab. Select data source as Loki and enter the {app=”nginx”} in Log labels. Note that metadata of Nginx deployment contains the label app=nginx which is the same that filters out logs in Loki.

Grafana Dashboard displayig logs from Loki

Grafana Dashboard displaying logs from Loki

Alerts enter into the Pending state as soon as its condition satisfies. This is verified in Prometheus’ dashboard under ‘Alerts’ tab.

Alert in Pending state

If this alert is active for 2 mins, it then enters into the Firing state.

Alert in Firing state

Once the alert enters into the Firing state, it propagates and gets notified to the respective Slack channel. The figure below shows nginx_hits alert notification in the test channel.

nignx_hits alert received inn slack channel

Conclusion

The main motive of this post is to demonstrate how you can enable logging, monitoring, and alerting with PLG stack. In short, one should consider using Grafana Loki a.k.a PLG due to the following aspects:

  • All the three tools are open-source with a large community support
  • It can be easily integrated with Prometheus
  • You can easily create a metric from logs e.g. nginx_hits as shown in the previous section
  • With a single Grafana dashboard, you can monitor both logs and metrics.

 

References

Header and Featured image from Freepik

Ruturaj Kadikar

Author Ruturaj Kadikar

More posts by Ruturaj Kadikar

Leave a Reply