Dainius Jocas / Jun 29 2021 / Published

Lucene-Grep a.k.a. `lmgrep`

`whoami`

{  "name": "Dainius Jocas",  "company": {    "name": "Vinted",    "mission": "Make second-hand the first choice worldwide"  },  "role": "Staff Engineer",  "website": "https://www.jocas.lt",  "twitter": "@dainius_jocas",  "github": "dainiusjocas",  "author_of_oss": ["lucene-grep", "ket"]}

Agenda

Intro
Whats inside Lucene-Grep?
Use cases
Future work
Discussion

Intro

lmgrep is a CLI full-text search tool
Interface is similar to grep
Based on Lucene
Lucene Monitor library is the main building block
Compiled with the GraalVM native-image
Single binary file, no external dependencies
Supports Linux, MacOS, Windows

Origin

Used Elasticsearch Percolator for some basic named entity recognition (NER)
Needed to deploy to AWS Lambda, Elasticsearch was not an option
However, I really liked the idea of expressing entities as full-text queries
Found the Luwak library, deployed on AWS Lambda, however it ran on JVM
Gunnar Morling blog post about GraalVM native-image Lucene on AWS Lambda
Convinced Red Hat devs to open source and release quarkiverse/quarkus-lucene
Hacked Lucene Grep

`grep` vs `lmgrep`

 echo "Lucene is awesome" | grep Lucene

1.0s

Bash

 echo "Lucene is awesome" | lmgrep Lucene

1.0s

Bash

Installing the `lmgrep`

brew or a shell script on Linux

wget https://github.com/dainiusjocas/lucene-grep/releases/download/v2021.05.23/lmgrep-v2021.05.23-linux-static-amd64.zipunzip lmgrep-v2021.05.23-linux-static-amd64.zipmv lmgrep /usr/local/bin

4.6s

Bash

brew on MacOS

brew install dainiusjocas/brew/lmgrep

Bash

scoop on Windows

scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojurescoop bucket add extrasscoop install lmgrep

3.0s

Bash

Whats inside?

Reading from file(s)
Searching for files with GLOB, e.g. '**/*.txt'
Reading from STDIN
Writing to STDOUT in various formats, e.g. JSON
Text analysis pipeline
Multiple query parsers
Text tokenization with --only-analyzeflag
Loading multiple queries from a file
Full-text search
lmgrep -h for the full list of available options

Text Analysis

The same good ol' lucene text analysis
45 predefined analyzers available, e.g. LithuanianAnalyzer
5 character filters
14 tokenizers
113 token filters
However, not everything that Lucene provides is available in lmgrep because of limitations of the GraalVM native-image
https://github.com/dainiusjocas/lucene-grep/blob/main/docs/analysis-components.md

Custom Text Analysis Issue

At first exposed several CLI flags for text analysis
- a problem with order of execution
Lucene analyzers are Java classes
For a CLI tool, exposing Java classes is not a good option
Something similar to Elasticsearch analysis syntax is needed

Text Analysis Definition

{    "char-filters": [      {"name": "htmlStrip"},      {        "name": "patternReplace",        "args": {          "pattern": "foo",          "replacement": "bar"        }      }    ],    "tokenizer": {"name": "standard"},    "token-filters": [      {"name": "englishMinimalStem"},      {"name": "uppercase"}    ]  }

Various Query Parsers `--query-parser`

`--query-parser`=classic

The default one
When googling for the Lucene query syntax, the first hit

echo "Lucene is awesome" | lmgrep --query-parser=classic "lucene is aweso~"

0.9s

Bash

echo "Lucene is awesome" | lmgrep --query-parser=classic "\"lucene is\""

0.9s

Bash

`--query-parser=`complex-phrase

similar to the classic query parser
but phrase queries are more expressive

echo "jonathann jon peterson" | lmgrep --query-parser=complex-phrase "\"(john jon jonathan~) peters*\""

1.1s

Bash

`--query-parser=`simple

similar to the classic query parser
BUT any errors in the query syntax will be ignored and the parser will attempt to decipher what it can
E.g. given term1\* searches for the term term1*
Probably should be the default query parser in lmgrep

`--query-parser=`standard

Implementation of the Lucene classic query parser using the flexible query parser frameworks
There must be a reason why it comes with the default lucene dependency

`--query-parser=`surround

Constructs span queries that use positional information

echo "Lucene is awesome" | lmgrep --query-parser=surround "2W(lucene, awesome)"

0.8s

Bash

if the term order is NOT important: W->N

echo "Lucene is awesome" | lmgrep --query-parser=surround "2N(awesome, lucene)"

0.8s

Bash

WARNING: query terms are not analyzed

`--only-analyze`

Just apply the text analyzer on the input text and output the list(s) of tokens

`--only-analyze`: basic example

echo "Lucene is awesome" | lmgrep --only-analyze

1.1s

Bash

`--only-analyze`: custom text analysis pipeline

echo "<p>foo bars baz</p>" | lmgrep --only-analyze --analysis='  {    "char-filters": [      {"name": "htmlStrip"},      {        "name": "patternReplace",         "args": {           "pattern": "foo",           "replacement": "bar"        }      }    ],    "tokenizer": {"name": "standard"},    "token-filters": [      {"name": "englishMinimalStem"},      {"name": "uppercase"}    ]  }  '

1.0s

Bash

["BAR","BAR","BAZ"]

`--only-analyze` with `--explain`

echo "Dogs and CAt" | lmgrep --only-analyze --explain | jq

0.8s

Bash

[  {    "token": "dog",    "position": 0,    "positionLength": 1,    "type": "<ALPHANUM>",    "end_offset": 4,    "start_offset": 0  },  {    "end_offset": 8,    "positionLength": 1,    "position": 1,    "start_offset": 5,    "type": "<ALPHANUM>",    "token": "and"  },  {    "position": 2,    "token": "cat",    "positionLength": 1,    "end_offset": 12,    "type": "<ALPHANUM>",    "start_offset": 9  }]

The idea is similar to the Elasticsearch's _analyze API
No need to recreate an index on every custom analyzer change

`--only-analyze`: output for graphviz

TODO

Loading queries from a file

echo "I have two dogs" | lmgrep --queries-file=dog-lovers.json

[  {    "id": "german_language",    "query": "hund",    "stemmer": "german"  },  {    "id": "english_language",    "query": "dog",    "stemmer": "english"  }]

load all queries once
100K queries takes about 1s to load on my laptop

Full-text search

mkdir democd demoecho "Lucene is awesome" > lucene.txtecho "Grep is awesome" > grep.txtlmgrep lucene **.txt

1.1s

Bash

Full-text File Search with Score

cdmkdir full-text-search || truecd full-text-searchecho "Lucene is awesome" > lucene.txtecho "Lucene Grep is build on Lucene Monitor library" > lucene-grep.txtlmgrep "Lucene" '**.txt' --no-split --with-score --format=json | jq -s -c 'sort_by(.score)[]' | tac | head -3 | jq

1.0s

Bash

Source Code Search

Specify a custom analyzer for you programming language
E.g. WordDelimiterGraphFilter that "MyFooClass" => ["My", "Foo", "Class"]
Enable scoring
Output hyperlinks in a (supported) terminal emulator to the specific line number

Alternative to Elasticsearch Percolator

Start a lmgrep with open STDIN, STDOUT, and STDERR pipes for inter-process communication

require 'open3'@stdin, @stdout, @stderr, @wait_thr = Open3.popen3("lmgrep lucene")@stdin.puts "Lucene is awesome"@stdout.gets

https://github.com/dainiusjocas/lucene-grep/tree/main/examples/ruby-percolator

Future work

Your issues https://github.com/dainiusjocas/lucene-grep/issues
Machanism for shared analysis components
- now only inlined text analysis config is supported
LMGREP_HOME for keeping all the resources in one place
Release analyzer construction code as a standalone library
Melt your CPU
- Use all CPU cores to the max for as short as possible
- Do not preserve the input order
Optimize --with-scored-highlights option
- Sort output by score
Analysis components with inlined data
- E.g. inlines stopwords list, not a file
Discussion

Lucene-Grep a.k.a. lmgrep

whoami

Agenda

Intro

Origin

grep vs lmgrep

Installing the lmgrep

Whats inside?

Text Analysis

Custom Text Analysis Issue

Text Analysis Definition

Various Query Parsers --query-parser

--query-parser=classic

--query-parser=complex-phrase

--query-parser=simple

--query-parser=standard

--query-parser=surround

--only-analyze

--only-analyze: basic example

--only-analyze: custom text analysis pipeline

--only-analyze with --explain

--only-analyze: output for graphviz

Loading queries from a file

Full-text search

Full-text File Search with Score

Source Code Search

Alternative to Elasticsearch Percolator

Future work

Discussion

Runtimes (1)

Lucene-Grep a.k.a. `lmgrep`

`whoami`

`grep` vs `lmgrep`

Installing the `lmgrep`

Various Query Parsers `--query-parser`

`--query-parser`=classic

`--query-parser=`complex-phrase

`--query-parser=`simple

`--query-parser=`standard

`--query-parser=`surround

`--only-analyze`

`--only-analyze`: basic example

`--only-analyze`: custom text analysis pipeline

`--only-analyze` with `--explain`

`--only-analyze`: output for graphviz