Clojure Code Complexity Study

Context: I wanted to scrape the code examples from clojuredocs.org

Corresponding Clojureverse thread: https://clojureverse.org/t/first-time-webscraper-could-you-give-any-pointers/

Research Readings

Goals

  • Brainstorm ways to access full form docs from clojuredocs.org

    • Web scrape

    • Find function like "doc" that fulfills objective

    • List all the kinds info you would like to get

      • Function description

      • Examples

      • Notes

  • Learn how to add new dependencies to Nextjournal project smoothly without any hitches

{:deps {org.clojure/clojure {:mvn/version "1.10.1"}
        ;; complient is used for autocompletion
        ;; add your libs here (and restart the runtime to pick up changes)
        compliment/compliment {:mvn/version "0.3.9"}
        enlive/enlive {:mvn/version "1.1.6"}
        http-kit/http-kit {:mvn/version "2.5.3"}
        org.clojure/tools.reader {:mvn/version "1.3.4"}}}
Extensible Data Notation
{:hello (clojure-version)}
0.1s

Testing...

(clojure.repl/doc print)
0.3s
;; question: How can I find the current namespace I am in?
;; answer: Access namespace object, see docs: https://clojuredocs.org/clojure.core/*ns*
*ns*
0.5s
(ns user (:require [net.cgrand.enlive-html :as html]
          [org.httpkit.client :as http]
          ;; question: Why does the following line result in an error?
          ;; [clojure.tools.reader.edn :as edn]
          ))
;; issue above research:
;; - [x] read https://github.com/clojure-emacs/cider/issues/2236
;; - [ ] read https://www.google.com/search?q=%22Could+not+locate+clojure%2Ftools%2Freader%2Fedn__init.class%22&sxsrf=ALeKk03xX2O1NJz_Wt89QwQhzYLWSKEvZA%3A1621361696937&source=hp&ei=IASkYP2WL5-r5NoPo-co&iflsig=AINFCbYAAAAAYKQSMCfEunhQamuHgOqfPJM9115yVZU7&oq=%22Could+not+locate+clojure%2Ftools%2Freader%2Fedn__init.class%22&gs_lcp=Cgdnd3Mtd2l6EAMyBwgjEK4CECc6BAgjECdQyQJY0Qtggw9oAHAAeACAAYkBiAH8ApIBAzAuM5gBAKABAaABAqoBB2d3cy13aXo&sclient=gws-wiz&ved=0ahUKEwj9pvT_6tPwAhWfFVkFHaMzCgAQ4dUDCAk&uact=5
3.1s
;; temporarily paused
#_(defn get-dom
  [address]
  (html/html-snippet
      (:body @(http/get address {:insecure? true}))))
0.0s
(def assoc-docs "https://clojuredocs.org/clojure.core/assoc")
0.0s
;; question: Why does this not return HTML or text as expected?
;; (get-dom assoc-docs)
0.0s
(defn fetch-page [url]
  (html/html-resource (java.net.URL. url)))
0.0s

first

:content

(nth 3)

:content

(def docs-content (-> (fetch-page assoc-docs)
  second :content second :content
  (->> (take 3))
  last :content second
  :content first
  :content first
  :content first
  :content first
  :content second
  :content))
2.0s
;; question: How can I select on classes, IDs, specific HTML tags, or text content, for example?
;; question:: Can this be further generalized to searching for a key or value within a nested map?
0.0s
(def docs-data-map
  ;; search on class ???
  {:title (-> docs-content first :content first :content first :content first)
   ;; search on class "docstring"
   :doc-string (-> docs-content second :content first :content first :content first)
   ;; search on class "examples-widget" or id "examples" 
   :examples (-> docs-content rest second :content)})
;; question: Where are the examples? Why aren't they here? Lol...
0.0s

Looking for the "window.PAGE_DATA"...

(def docs-entire-page
  (-> (fetch-page assoc-docs)))
0.8s
(def docs-examples-only
  (-> docs-entire-page
    second :content first :content last :content first))
0.1s

Question: How does one convert textual EDN into a Clojure hashmap?

(def examples-map-str (clojure.string/reverse (subs (clojure.string/reverse (subs docs-examples-only 31)) 8)) )
#_(count "// <![CDATA[ window.PAGE_DATA=") 
0.1s
(print examples-map-str)
0.4s
(def x
  (-> examples-map-str
        (clojure.string/replace #"\\\\\\" " ")
        ;; (clojure.string/replace #"\\\\n" "\\\n")
        ;; (clojure.string/replace #"\n\\" " \n\n")
        ;; (clojure.string/replace #":body\\" ":body\n\n")
        (clojure.string/replace #"\\\"" "\"") ;;;; 
        ;; (clojure.string/replace #":body \"" ":body\n\n\"")
          ))
(print x)                         
0.6s

Breakthrough! I think...

Initially, I tried two things. One, to remove slashes, quotes, and both, in different combinations. Two, to convert the entire string, and it's slash/quote/combo "reduced" permutations, into EDN via (clojure.core?) "read-string" and "clojure.edn/read-string". Consistently, I get errors such as "invalid character" and "wrong # of inputs" (should be an even number).

After looking at the data string for some more time, I guessed that, perhaps not all this data is EDN (or a hashmap) after all. By splitting into vectors on newline or double newline, I have now isolated something pretty close to what I think I was hoping to get.

I believe I now have something of a case analysis upon which to further differentiate items in these sub-vectors:

  • Meta data in the form of hashmaps

  • Code

  • Comments

  • Non-example "Notes"

;; TODOs:
;; - remove trailing slashes
;; - remove leading space + open double quote
;; - fix string-trailing space + close double quote to be only close double quote 
;; - fix #5 vector with unsplit strings
;;     - idea 1: add double newline into (", :)
;; - fix last line sitting by itself by "shunting it" to be the next vector's first line (see #5 and #7)
(def y (-> x
         (clojure.string/replace ":body" ":body\\n\\n")
         ;;(clojure.string/replace "\", :" "\",\\\\n:")
         (clojure.string/split #"\\\\n\\\\n")
         (->> (map #(clojure.string/split % #"\\n")))))
0.0s
;; question: What are the issues with this code?
#_(read-string y) 
0.0s
#_(clojure.edn/read-string y) 
0.0s
#_(clojure.edn/read-string examples-map-str)
0.0s
;; source: https://clojureverse.org/t/first-time-webscraper-could-you-give-any-pointers/7663/12?u=avidrucker
(require '[clojure.string :as str])
(require '[clojure.edn :as edn])
(defn fetch-page-data [clojuredocs-url]
  (let [ ; get the window.PAGE_DATA=... line from var page HTML source
        page-data-str (-> (slurp clojuredocs-url)
                          (str/split #"\n")
                          (nth 2) ; get the 3rd line of the HTML source
                        )]
    (->> page-data-str
         (drop (count "window.PAGE_DATA=")) ; Drop leading characters
         butlast ; Drop trailing semicolon
         (apply str)
         edn/read-string
         edn/read-string ; Double decode (not a mistake)
      )))
(fetch-page-data "https://clojuredocs.org/clojure.core/map")
1.9s
Runtimes (1)