The first step is to make names for every attribute we want to handle.
Here is a quick guide you can follow to help in this process:
Pick a general prefix, usually the product/service name. For this case, I'll pick: hacker-news.
For data points, add some entity context in the name. In the case of Hacker News, I see they call it item.
If you mean to read collections, give each collection a name.
note
If you are developing a product for a company, use the company name for step one.
Avoid having short names like :user/name since they have a higher collision chance.
This makes your names much harder to integrate with other names.
Let's see what are the interesting data points to extract from the Hacker News front
page:
The circle cursors point to visible points of data. The open diamond means the data
is hidden (inside the markup).
I used the name :hacker-news.page/news to name the collection of items for this page.
Here is a text version of all the declared attributes:
It's time to learn about the HTML structure of Hacker News page. I like to use the
Chrome inspector to navigate. I can see there is a table with the class itemlist
wrapping the item elements.
Now we can extract the rows out. Hacker News doesn't make it straight forward. When I
look at the rows, I see each item uses two table rows. Then it has a spacer row between
the next two with the class spacer. To add more details, there is a
different row with the class morespace in the end.
To deal with this, we are going to query for rows, removing the ones with the class spacer
or morespace.
Not much yet, but we gained the ability to filter pieces of the results.
tip
Some editors like Cursive do highlight keywords when your
cursor is over them. You can use this indication to see the inputs connecting with
outputs in the editor.
I used an optional input named :hacker-news.page/news-page-url to allow the
customization, but still have a default.
To provide the data with the URL for the next page, I'll add a new resolver. This resolver
will expose the attribute :hacker-news.page/news-next-page that contains the key
:hacker-news.page/news-page-url:
You can look at this resolver as an implementation of a linked list. The attribute
:hacker-news.page/news-next-page is a link to the next page item. In terms of Pathom,
we make that happen by providing the :hacker-news.page/news-page-url in that context, which can navigate
to the next :hacker-news.page/news-next-page and so on...
I check if there is a More link. Otherwise, we don't return any data to tell Pathom
this is unavailable.
You may notice we now have two resolvers that get the HTML string for the news page
and parse it. Each resolver is doing its parsing. We can make them share this
by breaking this step into a new resolver.
This is a crucial design choice when you write Pathom resolvers. How much you want
to break, as you add more resolvers, you expand the connection points to other resolvers.
In general, it is a good practice to keep spread, but it's fine to provide many items
in a resolver when they share a close process. This reduces the amount of work Pathom
has to do to integrate.
Let's play with our new resolvers:
; remember to update env to include all resolvers
(def env
(-> {::durable-cache* cache*}
(pci/register
[news-page-html-string
news-page-hickory
news-page
news-next-page])))
(comment
; get titles from first and second page
(p.eql/process env
[{:hacker-news.page/news
[:hacker-news.item/title]}
{:hacker-news.page/news-next-page
[{:hacker-news.page/news
[:hacker-news.item/title]}]}]))
Notice the query inside :hacker-news.page/news-next-page is the same used in the
parent query. For this we can use recursive queries, let's say we want to pull the
next three pages:
(comment
(p.eql/process env
[{:hacker-news.page/news
[:hacker-news.item/title]}
; recurse bounded to 3 steps
{:hacker-news.page/news-next-page 3}]))
How cool is that?! You may be saying now: ok, but that's a weird tree output.
To flat the items out we can use tree-seq:
(comment
(->> (p.eql/process env
[{:hacker-news.page/news
[:hacker-news.item/title]}
; recurse bounded to 3 steps
{:hacker-news.page/news-next-page 3}])
(tree-seq :hacker-news.page/news-next-page
; we need vector at the end because tree-seq expects children to be a collection
(comp vector :hacker-news.page/news-next-page))
; mapcat the news to have a single flat list
(into [] (mapcat :hacker-news.page/news))))
note
Recursive queries can be numbers (bounded) or a symbol ... (unbounded). If you use the
unbounded, it will pull pages until Hacker News is over with them. During the time
I tested there were 21 pages. If you try, it may take some time to finish.
Pathom also supports nested inputs, this means we can create a resolver to make that
same process we did with the query before:
(pco/defresolver all-news-pages [input]
{::pco/input [{:hacker-news.page/news
[:hacker-news.item/age
:hacker-news.item/author-name
:hacker-news.item/id
:hacker-news.item/comments-count
:hacker-news.item/score
:hacker-news.item/rank-in-page
:hacker-news.item/source
:hacker-news.item/title
:hacker-news.item/url]}
; note the recursive query here
{:hacker-news.page/news-next-page '...}]
::pco/output [{:hacker-news.page/news-all-pages
[:hacker-news.item/age
:hacker-news.item/author-name
:hacker-news.item/id
:hacker-news.item/comments-count
:hacker-news.item/score
:hacker-news.item/rank-in-page
:hacker-news.item/source
:hacker-news.item/title
:hacker-news.item/url]}]}
{:hacker-news.page/news-all-pages
(->> input
(tree-seq :hacker-news.page/news-next-page
(comp vector :hacker-news.page/news-next-page))
(into [] (mapcat :hacker-news.page/news)))})
Now we can, for example, make this query to read all titles in news, in all pages:
Similar to before, but this time we require some user id to load the page. The arrows
show that the same attribute we read on the page as :hacker-news.user/id is used
in the URL to load the page.
first :content second :content first :attrs :href)
[_ date] (re-find #"(\d{4}-\d{2}-\d{2})" str)]
date)})
Note we also use the durable cache, so we can keep playing it. When I created this I
was hitting the same cache entry until I got the extraction code right.
important
You may have noticed that we now have two different attributes that mean user id.
We have :hacker-news.item/author-name and now :hacker-news.user/id. If we try to load the karma for the user in the HN item, it won't be able to get there.
One idea is to change our previous resolver and rename :hacker-news.item/author-name
to :hacker-news.user/id. This would work, but this reduces the accuracy of this name
semantics. :hacker-news.item/author-name has a precise meaning. It's the author's
name in an item.
To reconcile this situation, we can create an alias-resolver,
that allows Pathom to navigate from one name to another. This is what I'm going to
use next.
It's also good to point out that aliases are directional. We are allowing
:hacker-news.item/author-name to be translated in :hacker-news.user/id, but not
the reverse.
Let's see who has the most karma from the front-page:
We can see at the top we have almost the same data as we did on the news list, except
the rank position (which makes sense since it's relative to that page).
Let's start writing a resolver that can read this information given some item id:
By looking at the page we can see comments are nested, and altough this example isn't
showing, they also support deep nesting.
Most of the time, the HTML will follow the structure of the data, but this isn't the
case here.
Try inspecting the page. You will see they use a flat table and manually add the
spacings to convey the nesting.
This means we need to do more work to reconstruct the tree from a flat structure.
Let's do that in parts. First, let's extract it in a closer way to what we have: a flat structure. On top of the data I mentioned in the image, we now will
also add a :hacker-news.comment/ident, which tells us the row's indentation level. Later I'll use this to transform the list into a tree.
We can test that with (remember to add all resolvers to the env):
(comment
; get all comments author names
(p.eql/process env
{:hacker-news.item/id "25733200"}
[{:hacker-news.item/comments-flat
[:hacker-news.comment/author-name]}]))
Now it's time to transform the list into a tree.
To think about this process, let's use make up some example data and how it should be
transformed:
; first, visually, it looks like this:
; 1
; | 2
; | 3
; 4
; | 5
; | | 6
; we have a list like this, with ids and indentations, the rest of the data we can ignore
; for the purpose of this transformation
[{:id 1 :ident 0}
{:id 2 :ident 1}
{:id 3 :ident 1}
{:id 4 :ident 0}
{:id 5 :ident 1}
{:id 6 :ident 2}]
; and our goal is to transform that into:
[{:id 1
:children [{:id 2}
{:id 3}]}
{:id 4
:children [{:id 5
:children [{:id 6}]}]}]
To do this, we have to go over the list and remember past items. My idea to remember the
current level is: When it goes up, add an item to a stack. When it goes down, remove it from the
stack. The stack will contain the ids of the parent items. As I scan, I'll also add
the items to a list, indexed by ID. This way, I can modify any item at any time with ease.
I decided it would be fun to use Pathom as part of this process too, and it's nice the
things I discussed already feel like a graph.
Let's do a pause on implementing HN things and play a bit with what we have.
In the previous examples we requested information using the EQL interface. The
EQL is the most efficient and precise way to use Pathom because in this way, Pathom
can look at a full-size request and optimize as much as possible.
I'll now introduce Smart Maps for our play here, which is a different interface to use
Pathom 3. Smart Maps are a data structure that works like Clojure maps, but when you
access some key that the map doesn't know the value, it uses the Pathom resolvers to
figure out (when possible).
To this demo I'll start with some item id and fetch some data:
There we can see many attributes related to the item (given we provided an item id). Also,
you can see all the pages, and some user data (because of the relationship from item to
the user).
important
The options offered by the Smart Map datafy are contextual. This means if you change
the data you may get different possible paths.
To some REPL play with it, here are some suggestions:
; author and content of first comment of the first post on news
(-> (psm/smart-map env {})
:hacker-news.page/news
first
:hacker-news.item/comments
first
(select-keys [:hacker-news.comment/author-name
:hacker-news.comment/content
:hacker-news.user/join-date]))
tip
Value responses to smart maps are also smart maps. This is what allows the navigation
like you have seen before. This also means you can datafy at any point to see what paths
are available, for example:
If we compare with the news resolvers, they are almost the same. The differences are:
The resolver names are different
The attribute names they are used for the collection node are different
The initial page uses a different URL
The table with items has some stuff before the actual first item. Our new implementation skips those.
This similarity can make you feel like there is some higher-order thing to do here.
Especially when we consider that the ask page will be another instance of the same
situation.
We can generalize this by writing a function that returns some resolvers, let's replace
our past page implementation with this idea: