Lucene-Grep a.k.a. lmgrep
whoami
{
"name": "Dainius Jocas",
"company": {
"name": "Vinted",
"mission": "Make second-hand the first choice worldwide"
},
"role": "Staff Engineer",
"website": "https://www.jocas.lt",
"twitter": "@dainius_jocas",
"github": "dainiusjocas",
"author_of_oss": ["lucene-grep", "ket"]
}
Agenda
Intro
Whats inside Lucene-Grep?
Use cases
Future work
Discussion
Intro
lmgrep
is a CLI full-text search toolInterface is similar to grep
Based on Lucene
Lucene Monitor library is the main building block
Compiled with the GraalVM
native-image
Single binary file, no external dependencies
Supports Linux, MacOS, Windows
Origin
Used Elasticsearch Percolator for some basic named entity recognition (NER)
Needed to deploy to AWS Lambda, Elasticsearch was not an option
However, I really liked the idea of expressing entities as full-text queries
Found the Luwak library, deployed on AWS Lambda, however it ran on JVM
Gunnar Morling blog post about GraalVM native-image Lucene on AWS Lambda
Convinced Red Hat devs to open source and release quarkiverse/quarkus-lucene
Hacked Lucene Grep
grep
vs lmgrep
echo "Lucene is awesome" | grep Lucene
echo "Lucene is awesome" | lmgrep Lucene
Installing the lmgrep
brew
or a shell script on Linux
wget https://github.com/dainiusjocas/lucene-grep/releases/download/v2021.05.23/lmgrep-v2021.05.23-linux-static-amd64.zip
unzip lmgrep-v2021.05.23-linux-static-amd64.zip
mv lmgrep /usr/local/bin
brew
on MacOS
brew install dainiusjocas/brew/lmgrep
scoop
on Windows
scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojure
scoop bucket add extras
scoop install lmgrep
Whats inside?
Reading from file(s)
Searching for files with GLOB, e.g. '
**/*.txt
'Reading from STDIN
Writing to STDOUT in various formats, e.g. JSON
Text analysis pipeline
Multiple query parsers
Text tokenization with
--only-analyze
flagLoading multiple queries from a file
Full-text search
lmgrep -h
for the full list of available options
Text Analysis
The same good ol'
lucene
text analysis45 predefined analyzers available, e.g.
LithuanianAnalyzer
5 character filters
14 tokenizers
113 token filters
However, not everything that Lucene provides is available in
lmgrep
because of limitations of the GraalVM native-imagehttps://github.com/dainiusjocas/lucene-grep/blob/main/docs/analysis-components.md
Custom Text Analysis Issue
At first exposed several CLI flags for text analysis
a problem with order of execution
Lucene analyzers are Java classes
For a CLI tool, exposing Java classes is not a good option
Something similar to Elasticsearch analysis syntax is needed
Text Analysis Definition
{
"char-filters": [
{"name": "htmlStrip"},
{
"name": "patternReplace",
"args": {
"pattern": "foo",
"replacement": "bar"
}
}
],
"tokenizer": {"name": "standard"},
"token-filters": [
{"name": "englishMinimalStem"},
{"name": "uppercase"}
]
}
Various Query Parsers --query-parser
--query-parser
=classic
The default one
When googling for the
Lucene query syntax
, the first hit
echo "Lucene is awesome" | lmgrep --query-parser=classic "lucene is aweso~"
echo "Lucene is awesome" | lmgrep --query-parser=classic "\"lucene is\""
--query-parser=
complex-phrase
similar to the
classic
query parserbut phrase queries are more expressive
echo "jonathann jon peterson" | lmgrep --query-parser=complex-phrase "\"(john jon jonathan~) peters*\""
--query-parser=
simple
similar to the
classic
query parserBUT any errors in the query syntax will be ignored and the parser will attempt to decipher what it can
E.g. given
term1\*
searches for the termterm1*
Probably should be the default query parser in
lmgrep
--query-parser=
standard
Implementation of the Lucene classic query parser using the flexible query parser frameworks
There must be a reason why it comes with the default
lucene
dependency
--query-parser=
surround
Constructs span queries that use positional information
echo "Lucene is awesome" | lmgrep --query-parser=surround "2W(lucene, awesome)"
if the term order is NOT important: W->N
echo "Lucene is awesome" | lmgrep --query-parser=surround "2N(awesome, lucene)"
WARNING: query terms are not analyzed
--only-analyze
Just apply the text analyzer on the input text and output the list(s) of tokens
--only-analyze
: basic example
echo "Lucene is awesome" | lmgrep --only-analyze
--only-analyze
: custom text analysis pipeline
echo "<p>foo bars baz</p>" | lmgrep --only-analyze --analysis='
{
"char-filters": [
{"name": "htmlStrip"},
{
"name": "patternReplace",
"args": {
"pattern": "foo",
"replacement": "bar"
}
}
],
"tokenizer": {"name": "standard"},
"token-filters": [
{"name": "englishMinimalStem"},
{"name": "uppercase"}
]
}
'
["BAR","BAR","BAZ"]
--only-analyze
with --explain
echo "Dogs and CAt" | lmgrep --only-analyze --explain | jq
[
{
"token": "dog",
"position": 0,
"positionLength": 1,
"type": "<ALPHANUM>",
"end_offset": 4,
"start_offset": 0
},
{
"end_offset": 8,
"positionLength": 1,
"position": 1,
"start_offset": 5,
"type": "<ALPHANUM>",
"token": "and"
},
{
"position": 2,
"token": "cat",
"positionLength": 1,
"end_offset": 12,
"type": "<ALPHANUM>",
"start_offset": 9
}
]
The idea is similar to the Elasticsearch's
_analyze
APINo need to recreate an index on every custom analyzer change
--only-analyze
: output for graphviz
TODO
Loading queries from a file
echo "I have two dogs" | lmgrep --queries-file=dog-lovers.json
[
{
"id": "german_language",
"query": "hund",
"stemmer": "german"
},
{
"id": "english_language",
"query": "dog",
"stemmer": "english"
}
]
load all queries once
100K queries takes about 1s to load on my laptop
Full-text search
mkdir demo
cd demo
echo "Lucene is awesome" > lucene.txt
echo "Grep is awesome" > grep.txt
lmgrep lucene **.txt
Full-text File Search with Score
cd
mkdir full-text-search || true
cd full-text-search
echo "Lucene is awesome" > lucene.txt
echo "Lucene Grep is build on Lucene Monitor library" > lucene-grep.txt
lmgrep "Lucene" '**.txt' --no-split --with-score --format=json | jq -s -c 'sort_by(.score)[]' | tac | head -3 | jq
Source Code Search
Specify a custom analyzer for you programming language
E.g. WordDelimiterGraphFilter that "MyFooClass" => ["My", "Foo", "Class"]
Enable scoring
Output hyperlinks in a (supported) terminal emulator to the specific line number
Alternative to Elasticsearch Percolator
Start a
lmgrep
with open STDIN, STDOUT, and STDERR pipes for inter-process communication
require 'open3'
@stdin, @stdout, @stderr, @wait_thr = Open3.popen3("lmgrep lucene")
@stdin.puts "Lucene is awesome"
@stdout.gets
Future work
Your issues https://github.com/dainiusjocas/lucene-grep/issues
Machanism for shared analysis components
now only inlined text analysis config is supported
LMGREP_HOME for keeping all the resources in one place
Release analyzer construction code as a standalone library
Melt your CPU
Use all CPU cores to the max for as short as possible
Do not preserve the input order
Optimize
--with-scored-highlights
optionSort output by score
Analysis components with inlined data
E.g. inlines stopwords list, not a file
Discussion