from series, elasticsearch.

how does elasticsearch work? - elasticsearch architecture - general overview

Elasticsearch is a fast and scalable search and analytics engine built on top of the Apache Lucene library. It is known for its powerful full-text search capabilities and is widely used in applications ranging from enterprise search to log and event data analysis.

Here are some of the key reasons to use Elasticsearch:

Near Real-Time (NRT) Search: Elasticsearch provides near-instant search results, making it an excellent choice for applications that require real-time search or analytics functionality. This is achieved through efficient indexing processes that update the search indices almost immediately after data is ingested.
Powerful Querying Abilities: Elasticsearch offers various search and query options, from simple keyword searches to complex compound queries, which can filter, aggregate, and sort results in ways that closely match user needs. This flexibility makes it suitable for a broad range of applications, including e-commerce, data monitoring, and enterprise search.
Scalability: Designed to scale horizontally, Elasticsearch allows you to manage growing data volumes by adding more nodes to the cluster, distributing both data and search load efficiently. This enables it to handle billions of documents and high query volumes without compromising performance.
High Availability: With built-in support for sharding and replication, Elasticsearch ensures high availability and fault tolerance. By distributing data across multiple nodes and replicating it, Elasticsearch can continue to serve queries even if some nodes fail.
Analytics: Beyond search, Elasticsearch provides robust analytics capabilities, including aggregations that enable complex data analysis and visualization directly within the engine. This allows users to derive insights from their data, such as identifying trends, patterns, and anomalies.
Integrations: Elasticsearch integrates seamlessly with a wide variety of tools and platforms, including log management systems (like Logstash and Kibana), security frameworks, and business intelligence tools. It also supports multiple programming languages, making it easy to embed search capabilities in various applications.

Due to these capabilities, Elasticsearch is currently the most popular enterprise search engine according to the DB-Engines ranking, making it a go-to solution for many organizations looking for scalable search and analytics.

Architecture Overview

An Elasticsearch cluster is composed of individual service units called nodes, which store data and perform indexing and querying operations. A cluster can have many nodes, and each node plays a specific role depending on its configuration:

Master Node: Responsible for cluster-wide operations such as creating and deleting indices, managing cluster state, and distributing shards. The master node is selected through an election among master-eligible nodes.
Data Node: Stores data and handles CRUD (Create, Read, Update, Delete) operations and search requests. Data nodes perform the heavy lifting of indexing and querying.
Ingest Node: Pre-processes documents before indexing, such as enriching data, transforming data formats, or removing unwanted fields. This helps in optimizing the data before it's stored in the indices.
Client (Coordinating) Node: Acts as a load balancer by routing requests to the appropriate data nodes and aggregating results. It does not hold data or perform indexing but plays a crucial role in distributing search load efficiently across the cluster.

Nodes contain shards, and indices are made up of these shards. An index, which can be thought of like a database, is distributed across different nodes in a multi-node cluster. This distribution allows Elasticsearch to scale and handle large datasets effectively.

Shards are of two types: primary shards and replica shards. Each primary shard can have one or more replica shards. Elasticsearch ensures that a primary shard and its replicas are never located on the same node, enhancing data redundancy and fault tolerance.

Primary Shard: Each document is stored in one primary shard. Elasticsearch uses these primary shards to create replica shards.
Replica Shard: A copy of the primary shard that provides redundancy. Replica shards improve query performance and provide failover in case the primary shard becomes unavailable.

Each shard is essentially a self-contained search engine that indexes and handles queries. This architecture allows Elasticsearch to split large datasets across multiple nodes and shards, improving both search performance and data availability.

Shards and Index

An index is a logical namespace that maps to one or more primary shards and their replicas. Each shard is a Lucene index, and these Lucene indices consist of segments, which are fully functional inverted indices.

But what is an inverted index?

An inverted index is a data structure commonly used in search engines to map terms to the documents containing them. Instead of having a list of documents and the terms they contain, an inverted index has a list of terms and the corresponding documents. This structure allows for quick full-text searches across large datasets, enabling fast retrieval of documents that match search criteria.

For example, consider the following documents:

“What is an inverted index?”
“Thinking about inverted index is so much fun.”
“I don't often index my thoughts as much as I should.”

The inverted index would look something like this:

"index": [1, 2, 3]
"inverted": [1, 2]
"much": [2, 3]

Word	Frequency	Document Numbers
What	1	1
is	2	1, 2
an	1	1
inverted	2	1, 2
index	3	1, 2, 3
Thinking	1	2
about	1	2
so	2	2, 3
much	2	2, 3
fun	1	2
I	1	3
don't	1	3
often	1	3
my	1	3
thoughts	1	3
as	2	2, 3
should	1	3

This structure allows Elasticsearch to efficiently retrieve documents containing specific terms, even in very large datasets.

What Happens When a Document is Indexed?

When a document is indexed, Elasticsearch performs several steps:

Parsing: The document data is parsed and converted into a structured format (JSON) that Elasticsearch understands. This ensures that the data can be processed and indexed correctly.
Analysis: The fields of the document are analyzed using a tokenizer and a set of filters (e.g., lowercasing, removing stop words). This step helps break down text fields into tokens, which are the searchable terms stored in the inverted index.
Inverted Index Creation: From the analyzed fields, terms are generated, and an inverted index is constructed. This data structure maps each term to the documents that contain it, enabling efficient full-text searches.
Vector Index Creation (if applicable): If the document contains vector fields (e.g., dense vectors from embeddings), Elasticsearch stores these vectors in a special type of index that allows for approximate nearest neighbor (ANN) search. This enables similarity search capabilities, such as finding documents with similar content based on vector representations from machine learning models. The vectors are indexed using algorithms like HNSW (Hierarchical Navigable Small World graphs) to support fast and efficient vector-based queries.
Document Storage: The document, along with its metadata (e.g., unique ID, version number), is stored in the index. This allows Elasticsearch to retrieve the document directly when a search query matches it.
Replication: If replica shards are configured, Elasticsearch replicates the document data across these shards to ensure high availability and redundancy. This replication process helps maintain data integrity and provides failover protection in case of node failures.
Updating the Cluster State: The cluster state is updated to reflect the new or modified document, including updates to shard allocation and indexing statistics. This step ensures the cluster's consistency and availability of the new data for search operations.
Index Refresh: Periodically, Elasticsearch refreshes the index to make newly indexed documents searchable. This involves making the latest changes visible to search queries by creating new segment files that include the newly indexed data.

Elasticsearch’s vector search functionality allows for more advanced search capabilities beyond traditional keyword matching. By using dense vectors (like those generated by neural networks or other embedding models), Elasticsearch can perform similarity searches based on the mathematical properties of the vectors. This is particularly useful in applications such as:

Recommendation Systems: Finding items that are similar to a user's preferences based on vector representations.
Semantic Search: Enabling searches that understand the meaning and context of queries rather than just matching keywords.
Image and Audio Search: Indexing vectors representing images or audio clips to find similar media content.

To use vector search, you define vector fields in your index mappings and specify the appropriate settings for vector search algorithms, such as the size of the vectors and the type of distance metric (e.g., cosine similarity, Euclidean distance). By incorporating vector fields and leveraging Elasticsearch's vector search capabilities, you can enhance your search experience with powerful, data-driven insights that go beyond traditional text search.

What Happens When a Query is Run?

When a query is executed in Elasticsearch, the search operation occurs in two main phases: the query phase and the fetch phase. This two-step process is designed to efficiently search across distributed data while optimizing performance.

Query Phase: The search query is broadcasted to all relevant shards, whether they are primary or replicas. Each shard executes the query independently, processing it locally to produce a list of matching documents. During this phase, Elasticsearch uses a customizable scoring algorithm to rank the results, with Okapi BM25 (based on TF-IDF) being the default method for text relevancy scoring. This algorithm considers factors like term frequency, inverse document frequency, and field-length normalization to calculate the relevance score of each document. The top-ranked results from each shard are then collected and merged into a single sorted list to produce the final ranked result set.
Fetch Phase: Once the top results are identified in the query phase, the fetch phase retrieves the actual documents from the shards using their document IDs. In this phase, Elasticsearch ensures that only the necessary documents are fetched, which optimizes performance by minimizing data transfer and reducing load on the nodes. The fetch phase is responsible for assembling the final result set that is returned to the user, including any requested fields, highlights, or additional metadata.

Coordination and Merging: During the query phase, a coordinating node aggregates the partial results from each shard, merges them based on the relevancy scores, and selects the top results that should be fetched in the fetch phase. This coordination reduces the amount of data moved across the cluster and ensures efficient use of resources.

Handling Large Results: For queries that return a large number of results, Elasticsearch uses a technique called pagination, which allows the client to request results in smaller batches (e.g., 10 results per page). This is achieved using the `from` and `size` parameters, which specify the offset and the number of results to return. This approach helps manage memory usage and network load, ensuring that the cluster remains responsive even under heavy query loads.

Support for Vector Search: In addition to traditional text-based queries, Elasticsearch can handle vector searches during the query phase. When dealing with vector fields, Elasticsearch uses algorithms like HNSW (Hierarchical Navigable Small World graphs) to perform approximate nearest neighbor searches, which are essential for tasks such as similarity search, recommendations, and semantic search. The query phase for vector searches also involves ranking and merging results based on vector similarity scores.

Query Optimization: Elasticsearch automatically optimizes query execution by using various techniques such as query caching, query rewriting, and shard preference settings. These optimizations help reduce the computational cost of frequent or complex queries and improve response times by reusing previously computed results when possible.

Security and Access Control: During the query execution, Elasticsearch enforces security rules and access controls defined by roles and permissions. This ensures that users can only access documents and fields that they are authorized to view, providing an additional layer of security in multi-user environments.

Conclusion

This brief introduction to Elasticsearch provides an overview of its architecture and core functionalities, highlighting why it's a popular choice for scalable search and analytics solutions.

In the next parts of the series, we'll dive deeper into more advanced topics, including performance tuning, advanced query techniques, and integrating Elasticsearch into your existing tech stack. Stay tuned!