Blogs

Switching From FluentD to Vector Log Aggregation Tool

Log files are extremely important to the data analysis process as they contain essential information about usage patterns, activities and operations within an operating system, application, server or device. This data is relevant to a number of use cases across an organization from resource management, application troubleshooting, regulation compliance and SIEM and business analytics and marketing insights. To manage logs created by these use cases and make use of this wealth of data, log aggregation tools enable organizations to systematically collect and standardize log files. However, choosing the right tool can be quite challenging.

This blog will detail and compare the popular open source fluentD and Vector tools for log aggregation.

fluentD Configuration and Efficiency Calculation

When using orchestration tools like Kubernetes to deploy containers or other API resources, there is a need for a log aggregator to store the pod or node logs in a cloud platform. For a particular requirement, fluentD was used as a log aggregator tool to push K8s pod logs to cloud storage buckets with a sample configuration as shown below:

@type <cloud platform name>

project <project name in cloud platform>

keyfile <credential json to access the cloud storage> bucket <cloud storage bucket name> object_key_format <name for the file to be used>

path <file prefix/path where the file have to be stored>

@type file

path /var/log/fluent/gcs timekey 1m timekey_wait 30 timekey_use_utc true flush_thread_count 16 flush_at_shutdown true flush_mode interval flush_interval 1 chunk_limit_size 10MB retry_max_interval 30 retry_wait 60

</buffer> <format>

@type json </format>

</match>

Using this system, fluentD was pushing only 47.62% of total logs to cloud storage. Since there was a loss of more than 50%, changes were made to the configuration. In most of the changes, the efficiency was somewhere between 40% and 50%, with a maximum efficiency achieved at an average 67% for an entire day. Below are some of the changes made along with the percentage of logs that were pushed to cloud storage:

<buffer tag,time> @type file

path /var/log/fluent/gcs timekey 1m timekey_wait 30 timekey_use_utc true flush_thread_count 16 flush_at_shutdown true retry_max_interval 60 retry_wait 30

</buffer> Efficiency:- 46.32%

<buffer tag,time> @type file

path /var/log/fluent/gcs timekey 1m timekey_wait 30 timekey_use_utc true flush_thread_count 16 flush_at_shutdown true

</buffer> Efficiency:- 49.89%

@type file

path /var/log/fluent/gcs timekey 10m timekey_wait 0 timekey_use_utc true flush_at_shutdown true

</buffer> Efficiency:- 37%

<buffer tag,time> @type file

path /var/log/fluent/gcs timekey 30 timekey_wait 0 timekey_use_utc true flush_thread_count 15 flush_at_shutdown true

</buffer> Efficiency:- 60.88%

<buffer tag,time> @type file

path /var/log/fluent/gcs timekey 1 timekey_wait 0 timekey_use_utc true flush_thread_count 16 flush_at_shutdown true flush_mode immediate

</buffer> Efficiency:- 66.77%

Vector Deployment, Configuration and Resultant Efficiency

To improve this further, the open source Vector tool by Datadog was also considered. This tool was suitable for K8s setup with a similar configuration as fluentD and was installed in the nodes.

A Helm command was used to clone the official repository in its VMs; the configuration was changed as described below and installed it as an agent. Vector comes in two working modes: Agent and aggregator. While agent is the plain mode that pushes logs/events from source to destination, aggregator is used to transform and ship data collected by other agents (in this case, Vector).

The installation of this tool requires a Helm repository in the local machine to fetch the source code. Hence the below commands were run in a sequential pattern before installing Vector in a K8s cluster:

helm repo add vector https://helm.vector.dev (Adding vector repo to helm list)

helm repo update (Updating the helm repos)

helm fetch –untar vector/vector . (command to clone the repository to local machine)

Configuration:-

data_dir: /vector-data-dir

sources:

<Custom source id>:

type: kubernetes_logs (because we are using kubernetes as our source)

exclude_paths_glob_patterns: <Array of directories which has be excluded when collecting the logs from the nodes> (Optional)

sinks:

<Custom sink id>:

type: <Destination cloud storage>

inputs: <Array of source id’s from where log has to be pushed>

bucket: <Bucket name of cloud storage>

key_prefix: <Path inside the bucket where the logs has to collected> (Optional) encoding:

codec: <Encoding of the log file> (Optional) Command to install the Vector:-

helm install vector . –namespace vector

After deploying Vector in the development environment and testing it, the efficiency was ~100% with negligible loss. The switch was then made to Vector and deployed in the production environment. Vector can ship up 100,000 events or logs/sec, which is a very high throughput rate compared to other tools for log aggregation performance. Vector was able to achieve 99.98-100% efficiency even in the Kubernetes production cluster.

To learn more about how DataOps can enable highly performant data pipelines with real-time logging and monitoring, watch this video.

Tags: DataOpsDevOps practiceFluentdIT observabilitylog aggregationSREVector

1 year ago

B E Harsha Vardhan

B E Harsha Vardhan is working as DevOps Engineer at Sigmoid. He has vast experience on cloud platforms, open source tools like linux, kubernetes, docker, jenkins, apache spark, YARN, databases like BigQuery, MySQL and MongoDB, and is well versed in programming on Python, Java, Shell script, C++ and C.

Our Infrastructure is Still Expanding
Infrastructure is expanding in almost every possible way, and this creates more of a burden…
Forget Shift Left: Why ‘No Shift’ is the Future of Software Innovation
A no shift strategy argues for developing and testing directly in production, bypassing the traditional…
SREs Say There’s Plenty of Room to Improve Incident Management
A global survey of site reliability engineers (SREs) found diagnosing issues is the most difficult…