Lucene-Grep a.k.a. lmgrep

whoami

{
  "name": "Dainius Jocas",
  "company": {
    "name": "Vinted",
    "mission": "Make second-hand the first choice worldwide"
  },
  "role": "Staff Engineer",
  "website": "https://www.jocas.lt",
  "twitter": "@dainius_jocas",
  "github": "dainiusjocas",
  "author_of_oss": ["lucene-grep", "ket"]
}
JSON

Agenda

  1. Intro

  2. Whats inside Lucene-Grep?

  3. Use cases

  4. Future work

  5. Discussion

Intro

  • lmgrep is a CLI full-text search tool

  • Interface is similar to grep

  • Based on Lucene

  • Lucene Monitor library is the main building block

  • Compiled with the GraalVM native-image

  • Single binary file, no external dependencies

  • Supports Linux, MacOS, Windows

Origin

  • Used Elasticsearch Percolator for some basic named entity recognition (NER)

  • Needed to deploy to AWS Lambda, Elasticsearch was not an option

  • However, I really liked the idea of expressing entities as full-text queries

  • Found the Luwak library, deployed on AWS Lambda, however it ran on JVM

  • Gunnar Morling blog post about GraalVM native-image Lucene on AWS Lambda

  • Convinced Red Hat devs to open source and release quarkiverse/quarkus-lucene

  • Hacked Lucene Grep

grep vs lmgrep

 echo "Lucene is awesome" | grep Lucene
1.0s
Bash
 echo "Lucene is awesome" | lmgrep Lucene
1.0s
Bash

Installing the lmgrep

brew or a shell script on Linux

wget https://github.com/dainiusjocas/lucene-grep/releases/download/v2021.05.23/lmgrep-v2021.05.23-linux-static-amd64.zip
unzip lmgrep-v2021.05.23-linux-static-amd64.zip
mv lmgrep /usr/local/bin
4.6s
Bash

brew on MacOS

brew install dainiusjocas/brew/lmgrep
Bash

scoop on Windows

scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojure
scoop bucket add extras
scoop install lmgrep
3.0s
Bash

Whats inside?

  • Reading from file(s)

  • Searching for files with GLOB, e.g. '**/*.txt'

  • Reading from STDIN

  • Writing to STDOUT in various formats, e.g. JSON

  • Text analysis pipeline

  • Multiple query parsers

  • Text tokenization with --only-analyzeflag

  • Loading multiple queries from a file

  • Full-text search

  • lmgrep -h for the full list of available options

Text Analysis

Custom Text Analysis Issue

  • At first exposed several CLI flags for text analysis

    • a problem with order of execution

  • Lucene analyzers are Java classes

  • For a CLI tool, exposing Java classes is not a good option

  • Something similar to Elasticsearch analysis syntax is needed

Text Analysis Definition

{
    "char-filters": [
      {"name": "htmlStrip"},
      {
        "name": "patternReplace",
        "args": {
          "pattern": "foo",
          "replacement": "bar"
        }
      }
    ],
    "tokenizer": {"name": "standard"},
    "token-filters": [
      {"name": "englishMinimalStem"},
      {"name": "uppercase"}
    ]
  }
JSON

Various Query Parsers --query-parser

--query-parser=classic

  • The default one

  • When googling for the Lucene query syntax, the first hit

echo "Lucene is awesome" | lmgrep --query-parser=classic "lucene is aweso~"
0.9s
Bash
echo "Lucene is awesome" | lmgrep --query-parser=classic "\"lucene is\""
0.9s
Bash

--query-parser=complex-phrase

  • similar to the classic query parser

  • but phrase queries are more expressive

echo "jonathann jon peterson" | lmgrep --query-parser=complex-phrase "\"(john jon jonathan~) peters*\""
1.1s
Bash

--query-parser=simple

  • similar to the classic query parser

  • BUT any errors in the query syntax will be ignored and the parser will attempt to decipher what it can

  • E.g. given term1\* searches for the term term1*

  • Probably should be the default query parser in lmgrep

--query-parser=standard

  • Implementation of the Lucene classic query parser using the flexible query parser frameworks

  • There must be a reason why it comes with the default lucene dependency

--query-parser=surround

  • Constructs span queries that use positional information

echo "Lucene is awesome" | lmgrep --query-parser=surround "2W(lucene, awesome)"
0.8s
Bash
  • if the term order is NOT important: W->N

echo "Lucene is awesome" | lmgrep --query-parser=surround "2N(awesome, lucene)"
0.8s
Bash
  • WARNING: query terms are not analyzed

--only-analyze

  • Just apply the text analyzer on the input text and output the list(s) of tokens

--only-analyze: basic example

echo "Lucene is awesome" | lmgrep --only-analyze
1.1s
Bash

--only-analyze: custom text analysis pipeline

echo "<p>foo bars baz</p>" | lmgrep --only-analyze --analysis='
  {
    "char-filters": [
      {"name": "htmlStrip"},
      {
        "name": "patternReplace",
         "args": {
           "pattern": "foo",
           "replacement": "bar"
        }
      }
    ],
    "tokenizer": {"name": "standard"},
    "token-filters": [
      {"name": "englishMinimalStem"},
      {"name": "uppercase"}
    ]
  }
  '
1.0s
Bash
["BAR","BAR","BAZ"]
JSON

--only-analyze with --explain

echo "Dogs and CAt" | lmgrep --only-analyze --explain | jq
0.8s
Bash
[
  {
    "token": "dog",
    "position": 0,
    "positionLength": 1,
    "type": "<ALPHANUM>",
    "end_offset": 4,
    "start_offset": 0
  },
  {
    "end_offset": 8,
    "positionLength": 1,
    "position": 1,
    "start_offset": 5,
    "type": "<ALPHANUM>",
    "token": "and"
  },
  {
    "position": 2,
    "token": "cat",
    "positionLength": 1,
    "end_offset": 12,
    "type": "<ALPHANUM>",
    "start_offset": 9
  }
]
JSON
  • The idea is similar to the Elasticsearch's _analyze API

  • No need to recreate an index on every custom analyzer change

--only-analyze: output for graphviz

  • TODO

Token graph

Loading queries from a file

echo "I have two dogs" | lmgrep --queries-file=dog-lovers.json
Bash
[
  {
    "id": "german_language",
    "query": "hund",
    "stemmer": "german"
  },
  {
    "id": "english_language",
    "query": "dog",
    "stemmer": "english"
  }
]
JSON
  • load all queries once

  • 100K queries takes about 1s to load on my laptop

Full-text search

mkdir demo
cd demo
echo "Lucene is awesome" > lucene.txt
echo "Grep is awesome" > grep.txt
lmgrep lucene **.txt
1.1s
Bash

Full-text File Search with Score

cd
mkdir full-text-search || true
cd full-text-search
echo "Lucene is awesome" > lucene.txt
echo "Lucene Grep is build on Lucene Monitor library" > lucene-grep.txt
lmgrep "Lucene" '**.txt' --no-split --with-score --format=json | jq -s -c 'sort_by(.score)[]' | tac | head -3 | jq
1.0s
Bash

Source Code Search

  • Specify a custom analyzer for you programming language

  • E.g. WordDelimiterGraphFilter that "MyFooClass" => ["My", "Foo", "Class"]

  • Enable scoring

  • Output hyperlinks in a (supported) terminal emulator to the specific line number

Alternative to Elasticsearch Percolator

  • Start a lmgrep with open STDIN, STDOUT, and STDERR pipes for inter-process communication

require 'open3'
@stdin, @stdout, @stderr, @wait_thr = Open3.popen3("lmgrep lucene")
@stdin.puts "Lucene is awesome"
@stdout.gets
Ruby

Future work

  • Your issues https://github.com/dainiusjocas/lucene-grep/issues

  • Machanism for shared analysis components

    • now only inlined text analysis config is supported

  • LMGREP_HOME for keeping all the resources in one place

  • Release analyzer construction code as a standalone library

  • Melt your CPU

    • Use all CPU cores to the max for as short as possible

    • Do not preserve the input order

  • Optimize --with-scored-highlights option

    • Sort output by score

  • Analysis components with inlined data

    • E.g. inlines stopwords list, not a file

    Discussion

Runtimes (1)