from series, elasticsearch.

how does elasticsearch work? - elasticsearch architecture - general overview

Elasticsearch is a fast and scalable search and analytics engine built on top of the Apache Lucene library. It is known for its powerful full-text search capabilities and is widely used in applications ranging from enterprise search to log and event data analysis.

Here are some of the key reasons to use Elasticsearch:

Due to these capabilities, Elasticsearch is currently the most popular enterprise search engine according to the DB-Engines ranking, making it a go-to solution for many organizations looking for scalable search and analytics.

Architecture Overview

nodes

An Elasticsearch cluster is composed of individual service units called nodes, which store data and perform indexing and querying operations. A cluster can have many nodes, and each node plays a specific role depending on its configuration:

Nodes contain shards, and indices are made up of these shards. An index, which can be thought of like a database, is distributed across different nodes in a multi-node cluster. This distribution allows Elasticsearch to scale and handle large datasets effectively.

nodes

Shards are of two types: primary shards and replica shards. Each primary shard can have one or more replica shards. Elasticsearch ensures that a primary shard and its replicas are never located on the same node, enhancing data redundancy and fault tolerance.

Each shard is essentially a self-contained search engine that indexes and handles queries. This architecture allows Elasticsearch to split large datasets across multiple nodes and shards, improving both search performance and data availability.

Shards and Index

An index is a logical namespace that maps to one or more primary shards and their replicas. Each shard is a Lucene index, and these Lucene indices consist of segments, which are fully functional inverted indices.

nodes

But what is an inverted index?

An inverted index is a data structure commonly used in search engines to map terms to the documents containing them. Instead of having a list of documents and the terms they contain, an inverted index has a list of terms and the corresponding documents. This structure allows for quick full-text searches across large datasets, enabling fast retrieval of documents that match search criteria.

For example, consider the following documents:

The inverted index would look something like this:

WordFrequencyDocument Numbers
What11
is21, 2
an11
inverted21, 2
index31, 2, 3
Thinking12
about12
so22, 3
much22, 3
fun12
I13
don't13
often13
my13
thoughts13
as22, 3
should13

This structure allows Elasticsearch to efficiently retrieve documents containing specific terms, even in very large datasets.

What Happens When a Document is Indexed?

When a document is indexed, Elasticsearch performs several steps:

  1. Parsing: The document data is parsed and converted into a structured format (JSON) that Elasticsearch understands. This ensures that the data can be processed and indexed correctly.
  2. Analysis: The fields of the document are analyzed using a tokenizer and a set of filters (e.g., lowercasing, removing stop words). This step helps break down text fields into tokens, which are the searchable terms stored in the inverted index.
  3. Inverted Index Creation: From the analyzed fields, terms are generated, and an inverted index is constructed. This data structure maps each term to the documents that contain it, enabling efficient full-text searches.
  4. Vector Index Creation (if applicable): If the document contains vector fields (e.g., dense vectors from embeddings), Elasticsearch stores these vectors in a special type of index that allows for approximate nearest neighbor (ANN) search. This enables similarity search capabilities, such as finding documents with similar content based on vector representations from machine learning models. The vectors are indexed using algorithms like HNSW (Hierarchical Navigable Small World graphs) to support fast and efficient vector-based queries.
  5. Document Storage: The document, along with its metadata (e.g., unique ID, version number), is stored in the index. This allows Elasticsearch to retrieve the document directly when a search query matches it.
  6. Replication: If replica shards are configured, Elasticsearch replicates the document data across these shards to ensure high availability and redundancy. This replication process helps maintain data integrity and provides failover protection in case of node failures.
  7. Updating the Cluster State: The cluster state is updated to reflect the new or modified document, including updates to shard allocation and indexing statistics. This step ensures the cluster's consistency and availability of the new data for search operations.
  8. Index Refresh: Periodically, Elasticsearch refreshes the index to make newly indexed documents searchable. This involves making the latest changes visible to search queries by creating new segment files that include the newly indexed data.

Elasticsearch’s vector search functionality allows for more advanced search capabilities beyond traditional keyword matching. By using dense vectors (like those generated by neural networks or other embedding models), Elasticsearch can perform similarity searches based on the mathematical properties of the vectors. This is particularly useful in applications such as:

To use vector search, you define vector fields in your index mappings and specify the appropriate settings for vector search algorithms, such as the size of the vectors and the type of distance metric (e.g., cosine similarity, Euclidean distance). By incorporating vector fields and leveraging Elasticsearch's vector search capabilities, you can enhance your search experience with powerful, data-driven insights that go beyond traditional text search.

What Happens When a Query is Run?

When a query is executed in Elasticsearch, the search operation occurs in two main phases: the query phase and the fetch phase. This two-step process is designed to efficiently search across distributed data while optimizing performance.

  1. Query Phase: The search query is broadcasted to all relevant shards, whether they are primary or replicas. Each shard executes the query independently, processing it locally to produce a list of matching documents. During this phase, Elasticsearch uses a customizable scoring algorithm to rank the results, with Okapi BM25 (based on TF-IDF) being the default method for text relevancy scoring. This algorithm considers factors like term frequency, inverse document frequency, and field-length normalization to calculate the relevance score of each document. The top-ranked results from each shard are then collected and merged into a single sorted list to produce the final ranked result set.
  2. Fetch Phase: Once the top results are identified in the query phase, the fetch phase retrieves the actual documents from the shards using their document IDs. In this phase, Elasticsearch ensures that only the necessary documents are fetched, which optimizes performance by minimizing data transfer and reducing load on the nodes. The fetch phase is responsible for assembling the final result set that is returned to the user, including any requested fields, highlights, or additional metadata.

    Coordination and Merging: During the query phase, a coordinating node aggregates the partial results from each shard, merges them based on the relevancy scores, and selects the top results that should be fetched in the fetch phase. This coordination reduces the amount of data moved across the cluster and ensures efficient use of resources.

    Handling Large Results: For queries that return a large number of results, Elasticsearch uses a technique called pagination, which allows the client to request results in smaller batches (e.g., 10 results per page). This is achieved using the `from` and `size` parameters, which specify the offset and the number of results to return. This approach helps manage memory usage and network load, ensuring that the cluster remains responsive even under heavy query loads.

    Support for Vector Search: In addition to traditional text-based queries, Elasticsearch can handle vector searches during the query phase. When dealing with vector fields, Elasticsearch uses algorithms like HNSW (Hierarchical Navigable Small World graphs) to perform approximate nearest neighbor searches, which are essential for tasks such as similarity search, recommendations, and semantic search. The query phase for vector searches also involves ranking and merging results based on vector similarity scores.

    Query Optimization: Elasticsearch automatically optimizes query execution by using various techniques such as query caching, query rewriting, and shard preference settings. These optimizations help reduce the computational cost of frequent or complex queries and improve response times by reusing previously computed results when possible.

    Security and Access Control: During the query execution, Elasticsearch enforces security rules and access controls defined by roles and permissions. This ensures that users can only access documents and fields that they are authorized to view, providing an additional layer of security in multi-user environments.

Conclusion

This brief introduction to Elasticsearch provides an overview of its architecture and core functionalities, highlighting why it's a popular choice for scalable search and analytics solutions.

In the next parts of the series, we'll dive deeper into more advanced topics, including performance tuning, advanced query techniques, and integrating Elasticsearch into your existing tech stack. Stay tuned!