Lucene-Grep a.k.a. lmgrep
whoami
{ "name": "Dainius Jocas", "company": { "name": "Vinted", "mission": "Make second-hand the first choice worldwide" }, "role": "Staff Engineer", "website": "https://www.jocas.lt", "twitter": "@dainius_jocas", "github": "dainiusjocas", "author_of_oss": ["lucene-grep", "ket"]}Agenda
Intro
Whats inside Lucene-Grep?
Use cases
Future work
Discussion
Intro
lmgrepis a CLI full-text search toolInterface is similar to grep
Based on Lucene
Lucene Monitor library is the main building block
Compiled with the GraalVM
native-imageSingle binary file, no external dependencies
Supports Linux, MacOS, Windows
Origin
Used Elasticsearch Percolator for some basic named entity recognition (NER)
Needed to deploy to AWS Lambda, Elasticsearch was not an option
However, I really liked the idea of expressing entities as full-text queries
Found the Luwak library, deployed on AWS Lambda, however it ran on JVM
Gunnar Morling blog post about GraalVM native-image Lucene on AWS Lambda
Convinced Red Hat devs to open source and release quarkiverse/quarkus-lucene
Hacked Lucene Grep
grep vs lmgrep
echo "Lucene is awesome" | grep Lucene echo "Lucene is awesome" | lmgrep LuceneInstalling the lmgrep
brew or a shell script on Linux
wget https://github.com/dainiusjocas/lucene-grep/releases/download/v2021.05.23/lmgrep-v2021.05.23-linux-static-amd64.zipunzip lmgrep-v2021.05.23-linux-static-amd64.zipmv lmgrep /usr/local/binbrew on MacOS
brew install dainiusjocas/brew/lmgrepscoop on Windows
scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojurescoop bucket add extrasscoop install lmgrepWhats inside?
Reading from file(s)
Searching for files with GLOB, e.g. '
**/*.txt'Reading from STDIN
Writing to STDOUT in various formats, e.g. JSON
Text analysis pipeline
Multiple query parsers
Text tokenization with
--only-analyzeflagLoading multiple queries from a file
Full-text search
lmgrep -hfor the full list of available options
Text Analysis
The same good ol'
lucenetext analysis45 predefined analyzers available, e.g.
LithuanianAnalyzer5 character filters
14 tokenizers
113 token filters
However, not everything that Lucene provides is available in
lmgrepbecause of limitations of the GraalVM native-imagehttps://github.com/dainiusjocas/lucene-grep/blob/main/docs/analysis-components.md
Custom Text Analysis Issue
At first exposed several CLI flags for text analysis
a problem with order of execution
Lucene analyzers are Java classes
For a CLI tool, exposing Java classes is not a good option
Something similar to Elasticsearch analysis syntax is needed
Text Analysis Definition
{ "char-filters": [ {"name": "htmlStrip"}, { "name": "patternReplace", "args": { "pattern": "foo", "replacement": "bar" } } ], "tokenizer": {"name": "standard"}, "token-filters": [ {"name": "englishMinimalStem"}, {"name": "uppercase"} ] }Various Query Parsers --query-parser
--query-parser=classic
The default one
When googling for the
Lucene query syntax, the first hit
echo "Lucene is awesome" | lmgrep --query-parser=classic "lucene is aweso~"echo "Lucene is awesome" | lmgrep --query-parser=classic "\"lucene is\""--query-parser=complex-phrase
similar to the
classicquery parserbut phrase queries are more expressive
echo "jonathann jon peterson" | lmgrep --query-parser=complex-phrase "\"(john jon jonathan~) peters*\""--query-parser=simple
similar to the
classicquery parserBUT any errors in the query syntax will be ignored and the parser will attempt to decipher what it can
E.g. given
term1\*searches for the termterm1*Probably should be the default query parser in
lmgrep
--query-parser=standard
Implementation of the Lucene classic query parser using the flexible query parser frameworks
There must be a reason why it comes with the default
lucenedependency
--query-parser=surround
Constructs span queries that use positional information
echo "Lucene is awesome" | lmgrep --query-parser=surround "2W(lucene, awesome)"if the term order is NOT important: W->N
echo "Lucene is awesome" | lmgrep --query-parser=surround "2N(awesome, lucene)"WARNING: query terms are not analyzed
--only-analyze
Just apply the text analyzer on the input text and output the list(s) of tokens
--only-analyze: basic example
echo "Lucene is awesome" | lmgrep --only-analyze--only-analyze: custom text analysis pipeline
echo "<p>foo bars baz</p>" | lmgrep --only-analyze --analysis=' { "char-filters": [ {"name": "htmlStrip"}, { "name": "patternReplace", "args": { "pattern": "foo", "replacement": "bar" } } ], "tokenizer": {"name": "standard"}, "token-filters": [ {"name": "englishMinimalStem"}, {"name": "uppercase"} ] } '["BAR","BAR","BAZ"]--only-analyze with --explain
echo "Dogs and CAt" | lmgrep --only-analyze --explain | jq[ { "token": "dog", "position": 0, "positionLength": 1, "type": "<ALPHANUM>", "end_offset": 4, "start_offset": 0 }, { "end_offset": 8, "positionLength": 1, "position": 1, "start_offset": 5, "type": "<ALPHANUM>", "token": "and" }, { "position": 2, "token": "cat", "positionLength": 1, "end_offset": 12, "type": "<ALPHANUM>", "start_offset": 9 }]The idea is similar to the Elasticsearch's
_analyzeAPINo need to recreate an index on every custom analyzer change
--only-analyze: output for graphviz
TODO

Token graph
Loading queries from a file
echo "I have two dogs" | lmgrep --queries-file=dog-lovers.json[ { "id": "german_language", "query": "hund", "stemmer": "german" }, { "id": "english_language", "query": "dog", "stemmer": "english" }]load all queries once
100K queries takes about 1s to load on my laptop
Full-text search
mkdir democd demoecho "Lucene is awesome" > lucene.txtecho "Grep is awesome" > grep.txtlmgrep lucene **.txtFull-text File Search with Score
cdmkdir full-text-search || truecd full-text-searchecho "Lucene is awesome" > lucene.txtecho "Lucene Grep is build on Lucene Monitor library" > lucene-grep.txtlmgrep "Lucene" '**.txt' --no-split --with-score --format=json | jq -s -c 'sort_by(.score)[]' | tac | head -3 | jqSource Code Search
Specify a custom analyzer for you programming language
E.g. WordDelimiterGraphFilter that "MyFooClass" => ["My", "Foo", "Class"]
Enable scoring
Output hyperlinks in a (supported) terminal emulator to the specific line number
Alternative to Elasticsearch Percolator
Start a
lmgrepwith open STDIN, STDOUT, and STDERR pipes for inter-process communication
require 'open3'@stdin, @stdout, @stderr, @wait_thr = Open3.popen3("lmgrep lucene")@stdin.puts "Lucene is awesome"@stdout.getsFuture work
Your issues https://github.com/dainiusjocas/lucene-grep/issues
Machanism for shared analysis components
now only inlined text analysis config is supported
LMGREP_HOME for keeping all the resources in one place
Release analyzer construction code as a standalone library
Melt your CPU
Use all CPU cores to the max for as short as possible
Do not preserve the input order
Optimize
--with-scored-highlightsoptionSort output by score
Analysis components with inlined data
E.g. inlines stopwords list, not a file
Discussion