Clojure Code Complexity Study
Context: I wanted to scrape the code examples from clojuredocs.org
Corresponding Clojureverse thread: https://clojureverse.org/t/first-time-webscraper-could-you-give-any-pointers/
Research Readings
Practice web scraping using this as an example: https://stackoverflow.com/questions/14031133/using-clojure-to-scrape-a-web-page-with-dynamic-content
Goals
Brainstorm ways to access full form docs from clojuredocs.org
Web scrape
Find function like "doc" that fulfills objective
List all the kinds info you would like to get
Function description
Examples
Notes
Learn how to add new dependencies to Nextjournal project smoothly without any hitches
{:deps {org.clojure/clojure {:mvn/version "1.10.1"}
;; complient is used for autocompletion
;; add your libs here (and restart the runtime to pick up changes)
compliment/compliment {:mvn/version "0.3.9"}
enlive/enlive {:mvn/version "1.1.6"}
http-kit/http-kit {:mvn/version "2.5.3"}
org.clojure/tools.reader {:mvn/version "1.3.4"}}}
{:hello (clojure-version)}
Testing...
(clojure.repl/doc print)
;; question: How can I find the current namespace I am in?
;; answer: Access namespace object, see docs: https://clojuredocs.org/clojure.core/*ns*
*ns*
(ns user (:require [net.cgrand.enlive-html :as html]
[org.httpkit.client :as http]
;; question: Why does the following line result in an error?
;; [clojure.tools.reader.edn :as edn]
))
;; issue above research:
;; - [x] read https://github.com/clojure-emacs/cider/issues/2236
;; - [ ] read https://www.google.com/search?q=%22Could+not+locate+clojure%2Ftools%2Freader%2Fedn__init.class%22&sxsrf=ALeKk03xX2O1NJz_Wt89QwQhzYLWSKEvZA%3A1621361696937&source=hp&ei=IASkYP2WL5-r5NoPo-co&iflsig=AINFCbYAAAAAYKQSMCfEunhQamuHgOqfPJM9115yVZU7&oq=%22Could+not+locate+clojure%2Ftools%2Freader%2Fedn__init.class%22&gs_lcp=Cgdnd3Mtd2l6EAMyBwgjEK4CECc6BAgjECdQyQJY0Qtggw9oAHAAeACAAYkBiAH8ApIBAzAuM5gBAKABAaABAqoBB2d3cy13aXo&sclient=gws-wiz&ved=0ahUKEwj9pvT_6tPwAhWfFVkFHaMzCgAQ4dUDCAk&uact=5
;; temporarily paused
_(defn get-dom
[address]
(html/html-snippet
(:body (http/get address {:insecure? true}))))
(def assoc-docs "https://clojuredocs.org/clojure.core/assoc")
;; question: Why does this not return HTML or text as expected?
;; (get-dom assoc-docs)
(defn fetch-page [url]
(html/html-resource (java.net.URL. url)))
first
:content
(nth 3)
:content
(def docs-content (-> (fetch-page assoc-docs)
second :content second :content
(->> (take 3))
last :content second
:content first
:content first
:content first
:content first
:content second
:content))
;; question: How can I select on classes, IDs, specific HTML tags, or text content, for example?
;; question:: Can this be further generalized to searching for a key or value within a nested map?
(def docs-data-map
;; search on class ???
{:title (-> docs-content first :content first :content first :content first)
;; search on class "docstring"
:doc-string (-> docs-content second :content first :content first :content first)
;; search on class "examples-widget" or id "examples"
:examples (-> docs-content rest second :content)})
;; question: Where are the examples? Why aren't they here? Lol...
Looking for the "window.PAGE_DATA"...
(def docs-entire-page
(-> (fetch-page assoc-docs)))
(def docs-examples-only
(-> docs-entire-page
second :content first :content last :content first))
Question: How does one convert textual EDN into a Clojure hashmap?
(def examples-map-str (clojure.string/reverse (subs (clojure.string/reverse (subs docs-examples-only 31)) 8)) )
_(count "// <![CDATA[ window.PAGE_DATA=")
(print examples-map-str)
(def x
(-> examples-map-str
(clojure.string/replace "\\\\\\" " ")
;; (clojure.string/replace #"\\\\n" "\\\n")
;; (clojure.string/replace #"\n\\" " \n\n")
;; (clojure.string/replace #":body\\" ":body\n\n")
(clojure.string/replace "\\\"" "\"") ;;;;
;; (clojure.string/replace #":body \"" ":body\n\n\"")
))
(print x)
Breakthrough! I think...
Initially, I tried two things. One, to remove slashes, quotes, and both, in different combinations. Two, to convert the entire string, and it's slash/quote/combo "reduced" permutations, into EDN via (clojure.core?) "read-string" and "clojure.edn/read-string". Consistently, I get errors such as "invalid character" and "wrong # of inputs" (should be an even number).
After looking at the data string for some more time, I guessed that, perhaps not all this data is EDN (or a hashmap) after all. By splitting into vectors on newline or double newline, I have now isolated something pretty close to what I think I was hoping to get.
I believe I now have something of a case analysis upon which to further differentiate items in these sub-vectors:
Meta data in the form of hashmaps
Code
Comments
Non-example "Notes"
;; TODOs:
;; - remove trailing slashes
;; - remove leading space + open double quote
;; - fix string-trailing space + close double quote to be only close double quote
;; - fix #5 vector with unsplit strings
;; - idea 1: add double newline into (", :)
;; - fix last line sitting by itself by "shunting it" to be the next vector's first line (see #5 and #7)
(def y (-> x
(clojure.string/replace ":body" ":body\\n\\n")
;;(clojure.string/replace "\", :" "\",\\\\n:")
(clojure.string/split "\\\\n\\\\n")
(->> (map (clojure.string/split % "\\n")))))
;; question: What are the issues with this code?
_(read-string y)
_(clojure.edn/read-string y)
_(clojure.edn/read-string examples-map-str)
;; source: https://clojureverse.org/t/first-time-webscraper-could-you-give-any-pointers/7663/12?u=avidrucker
(require [clojure.string :as str])
(require [clojure.edn :as edn])
(defn fetch-page-data [clojuredocs-url]
(let [ ; get the window.PAGE_DATA=... line from var page HTML source
page-data-str (-> (slurp clojuredocs-url)
(str/split "\n")
(nth 2) ; get the 3rd line of the HTML source
)]
(->> page-data-str
(drop (count "window.PAGE_DATA=")) ; Drop leading characters
butlast ; Drop trailing semicolon
(apply str)
edn/read-string
edn/read-string ; Double decode (not a mistake)
)))
(fetch-page-data "https://clojuredocs.org/clojure.core/map")