Webscraping and Data Cleaning in Clojure for NLP

In this notebook, my goal is to:

  • [x] get the HTML of a story from a website The Shoes That Were Danced to Pieces by Jacob and Wilhelm Grimm

  • [x] parse the HTML into plain text

  • [x] clean the text into a list of sentences...

  • [o] ... where each sentence is a list of words and punctuation

  • [ ] create a list of bigrams and trigrams in the text

  • [ ] count and sort bigrams and trigrams for frequency

  • [ ] investigate ways of conducting NLP on text to create value

{:deps {org.clojure/clojure {:mvn/version "1.10.1"}
        enlive/enlive {:mvn/version "1.1.6"}
        compliment/compliment {:mvn/version "0.3.9"}}}
deps.edn
Extensible Data Notation

The "enlive" library is helpful here (necessary?) to fetch data from a website.

(ns foo.bar
  (:require [net.cgrand.enlive-html :as html]))
(defn fetch-page [url]
  (html/html-resource (java.net.URL. url)))
1.8s
Clojure
(-> (fetch-page "https://www.pitt.edu/~dash/grimm133.html")
  first
  :content
  (nth 3)
  :content)
0.4s
Clojure
(fetch-page "https://www.pitt.edu/~dash/grimm133.html")
0.6s
Clojure
(def html (fetch-page "https://www.pitt.edu/~dash/grimm133.html"))
2.0s
Clojure
(-> html
  (html/select [:body])
  first
  :content
  (->> (map html/text))
  clojure.string/join
  (fn [v])(hash-map :nextjournal/viewer :text :nextjournal/value %))
0.4s
CurrentClojure

By using the :hiccup viewer instead of text, and by rendering via the :pre tag, new lines will be rendering as actual line breaks rather than simply just as text.

^{:nextjournal/viewer :hiccup}
[:pre (clojure.string/join (map html/text (html/select html [:body])))]
0.1s
Clojure
(first (map html/text (html/select html [:body])))
0.0s
Clojure
^{:nextjournal/viewer :text}
[:pre (map html/text (html/select html [:body]))]
0.0s
Clojure
;; ^{:nextjournal/viewer :text}
;; [:pre (map html/text (html/select html [:body]))]
Clojure

Let's begin

Back when I studied data analytics, there was an assignment to parse a story into words and punctuation. This process of converting sentences into individual elements is called "tokenization."

As a more fun challenge, I'm going to use the current text above as the starting point (because data cleanup isn't always pretty).

(def dirty-text (first (map html/text (html/select html [:body]))))
(prn dirty-text)
0.4s
Clojure
(def split-text (clojure.string/split dirty-text #"\n\n"))
split-text
0.3s
Clojure

Awesome, after scraping the HTML, parsing it into plain text, and splitting on two carriage returns (or is it "new lines"?), we have something that looks a little more manageable. We still don't have sentences in a single separate strings (and that would be nice), and we still have a bunch of empty (or close to empty) elements.

Firstly let's remove the empty elements.

;; (def blanks-removed (filter (complement clojure.string/blank?) split-text))
;; (take 10 blanks-removed)
0.0s
Clojure

Upon further inspection, it looks like we can get the entire story into a single string. By splitting upon double new lines, rather than a single new line, the entire story is now in the 3rd index. Let's see the story by itself:

(def story-only (nth split-text 2))
(print (subs story-only 0 250))
0.2s
Clojure

This looks good to me. Now, I'd like to split it into sentences, then words and punctuation.

(def sentences (clojure.string/split story-only #"\. "))
;; (print (nth sentences 0))
(map println sentences)
0.6s
Clojure

We have made some further progress. Some sentences are still "stuck" to each other, such as sentences with quotations, and sentences end with punctuation other than periods. We can add to the list of things to split on (." ), (?" ), and (!" ) Also, some other sentences are seemingly split by new lines...

To be continued...

Runtimes (1)