Elasticsearch: an introduction to the database and heart of the Elastic Stack

By Lais Ortiz

In this article, I will present the basic concepts of this tool created in 2010 by Elasticsearch N.V and you will understand why this tool is considered so powerful.

If you’re a non-relational database and technology enthusiast, you’ve probably heard of what is currently one of the most popular NoSQL databases: Elasticsearch.

First things first, who belongs to the Elastic Stack family?

Elasticsearch is one of the technologies belonging to the Elastic Stack, those are: Kibana, Elasticsearch, Logstash, and Beats.

Let’s talk briefly about Elasticsearch’s three siblings:

Kibana: A data visualization and management tool for Elasticsearch that provides real-time histograms, line charts, pie charts, and maps. A very common use of Kibana is analyzing and visualizing data coming from Elasticsearch.
Logstash: Used to aggregate and process data and send it to Elasticsearch. It’s a data processing pipeline that receives data of all shapes, sizes, and from infinite sources simultaneously, and sends it to Elasticsearch for storage.
Beats: A free and open-source platform that contains a collection of purpose-built agents. Beats can be on multiple machines and systems, whether on servers or in your containers acquiring hundreds or thousands of data, and feeding it to ElasticSearch.

After all, what is Elasticsearch?

In a few words, Elasticsearch (aka ES) is a distributed search and analytics engine that provides near real-time search and analytics for all types of data. No matter what type of data you work with, Elasticsearch can probably handle it and help your needs, since ES supports structured or unstructured text, numerical data, or geospatial data search and analysis.

Elasticsearch was built in Apache Lucene and developed in Java to be – and it surely is – a tool that provides searching and storage of huge volumes of data quickly and, as said before (it’s worth mentioning this again), near real-time, executing requests in questions of milliseconds, which is possible due ES’ architecture and the fact it works processing JSON requests with the HTTP GET, POST and PUT methods and returning the data in JSON format (that is a really light type of data).

The key components of Elasticsearch architecture

Elasticsearch architecture is built to be horizontally and vertically scalable, fast and flexible. In this topic, I’m going to show you the core components of ES architecture: cluster, nodes, shards, replicas, and analyzers.

Cluster

An Elasticsearch Cluster is a group of one or more Node instances that stores data and are connected. The objective of the Cluster is to distribute, logically, all the tasks, indexing and searching of these Nodes, and the best of all: you can run as many Clusters as you want!

Node

A Node belongs to a cluster and is an instance of a type of server that stores the data and executes the indexing of this information. By default, when an Elasticsearch instance starts, a node also starts running. Thanks to the fact that the Node works as an instance, it’s also possible to run as many Nodes as needed.

There are three main options while configuring an ES Node: Master Node (controls the Elasticsearch cluster), Data Node (contains data), and Client Node (serves as a load balancer that routes incoming requests to many Nodes).

Shards

It’s possible to split each index into small pieces called Shards, which are created containing the index’s information subtracted and it’s totally functional and independent to be added or not in some Node. To access a specific Shard, you can search by its index.

This approach is used to avoid that the index exceeds the storage limits of the hosting server (which is almost impossible) and to protect the index from possible failures at hardware and, for sure, to increase the power of parallel querying.

Replicas

In addition to splitting the indices into smaller pieces called Shards, Elasticsearch also allows us to split these Shards into even smaller fragments called Replicas.

A Shard has as many replicas as necessary to meet the needs of the project! It’s important to mention that in addition to increasing performance in data searching, a Replica can be used as a fail-safe mechanism for backup and recovery purposes.

Analyzers

At this point, I could not find a better explanation of what an Analyzer is than the official Elasticsearch reference: “The analyzer parameter specifies the analyzer used for text analysis when indexing or searching a text field.” And there are various types of analyzers, each one used for each purpose, these are: Standard, Simple, Whitespace, Stop, Keyword, Patter, Language, and Fingerprint, or even custom analyzers for your needs.

Elasticsearch Documents

And last but not least, let’s talk about Elasticsearch Documents. To understand better what a document is, we can compare it to a row in a relational database, but the document is stored at an ES index in JSON format.

One possibility that ES gives to you is that you can determine the type of data that will be stored in the index. There are different types, for example, text, keyword, float, time, geo point, or various other data types. A document is defined by key and value, the key is the name of the field, and the value can be different types of data as described.

Let’s look at this simple and interesting image above comparing Elasticsearch and SQL Database terminologies. It’s important to understand what each comparison represents in order to understand the Documents reserved fields, aka metadatas, such as: _index, _type and _id (the unique identifier for the document), as shown in the image below: