02 Elasticsearch

Elasticsearch stores data structures in JSON documents which are distributed and can be accessed from any ES node when in a cluster.

When stored, new documents are indexed and made fully searchable.

Indices

Data organization mechanism
Define the structure of documents via mappings
Partitioned in shards
Automatic indexing of new data

Elasticsearch is based on the concept of Inverted Index: lists every unique word from every document and identifies which documents a word appears in.

Shards and Replicas

Shards distribute operations to

Increase resistance to faults
Improve performance

Replicas are copies of primary shards stored on a different node.

Write Operation are performed on the Primary shard then on Replicas (to copy)

Read Operation are either performed on the Primary shard or a Replica

Term Frequency-Inverse Document Frequency (TF-IDF)

Results are scored with a Practical Scoring Function, which uses Term Frequency-Inverse Document Frequency (TF-IDF):

Measures the relevance of a term inside a document
Penalizes the words that appear too often across different documents

Term Frequency measures the frequency of term i in document j: $$tf_{i,j} = \frac{n_{i,j}}{|d_j|}$$ Where $n_{i,j}$ is the number of time term i appear in document while $|d_j|$ is the number of terms in document j.

Inverse Document Frequency measures how common a term is across the collection of documents, the lower the less important that term is (because it is an inverse frequency): $$idf_{i} = \log{\frac{|D|}{\lbrace d: i \in d \rbrace}}$$ Where $|D|$ is the number of documents in a collection while $\lbrace d: i \in d \rbrace$ is the number of documents containing the term i.

TF-IDF $$(tf-idf)_{i,j} = tf_{i,j} \times idf_{i}$$

Mapping

By defining a mapping for an index you can tell Elasticsearch the types of your documents’ fields

Define searchable fields
Enable full-text search, time and geo-based queries

N.B. you can’t change the mapping on an existing index that already has documents!

Dynamic Mapping, it’s risky (e.g. dates could be parsed as strings)

Schemaless

Elasticsearch is schemaless, i.e., if a schema is not provided, it tries to guess the structure of the documents.

Interaction

Interactions with Elasticsearch happen through requests to REST endpoints The actions that can be performed depend on HTTP verb

GET is used to read document, indices metadata, etc.
POST and PUT are often used to create new documents, indices, etc.
DELETE is used to delete documents, indices, etc.

Beware of the differences between POST and PUT

POST doesn’t require the ID of the resource → duplicates
- When omitted, Elasticsearch takes care of creating and assigning IDs to documents

POST /index_name/_doc

PUT requires the ID of the resource → create or update the same one

PUT /index_name/_doc/document_id

Language analyzer

Language analyzers preprocess textual field, for example

It removes stopwords of a given language
Perform stemming

Standard Analyzer The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.

Simple Analyzer The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.

Whitespace Analyzer The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.

Stop Analyzer The stop analyzer is like the simple analyzer, but also supports removal of stop words.

Keyword Analyzer The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.

Pattern Analyzer It performs the analysis according to a custom regexp pattern

01 The Data-driven Virtuous Cycle

02 What is Big Data and new ways to solve problems

03 Data-driven decisions

01 ER

02 Relational Model

03 ER Exercises

01 Introduction to API

02 RESTful API

03 Scraping

01 NoSQL General Concepts

02 Transactional Properties in NoSQL

03 Brief NoSQL history

04 A map of NoSQL technologies

01 Graph Theory

02 Graph Databases

03 Neo4J

Exam Questions

01 Introduction

02 MongoDB

03 MongoDB Queries

Exam questions

01 Introduction

02 Redis

03 Memcache

Exam questions

01 Introduction

02 Cassandra

03 HBase

04 Cassandra Query Language

01 ELK stack

02 Elasticsearch

03 Elasticsearch operations

04 Logstash

Introduction

HDFS Architecture, Security and Configuration

HDFS Commands

Hadoop 2.0

MapReduce

Hadoop key components

Scheduling

HBase

Pig

Hive

Impala

Storm

Flume

Sqoop

01 The challange and importance of data wrangling

02 Data Wrangling Process

02 Elasticsearch

Indices

Shards and Replicas

Term Frequency-Inverse Document Frequency (TF-IDF)

Mapping

Schemaless

Interaction

Language analyzer

No Comments