You need to know:
You must use xPath to write element finding rules for pageElement and nextLink. Here
is a good cheatsheet
I use: https://devhints.io/xpath
You must use RegExp to write URL matching rules. https://devhints.io/regexp
You can write custom rules in the box below. Remember to press the save button.
url ^https:\/\/www\.linuxquestions\.org\/questions\/
pageElement //div[@id="posts"]//div[@class="page"]
nextLink //a[@rel="next"]
=== *** ===
In case you want to disable pagination on some pages place their corresponding RegExp in the box below.
^https?:\/\/www\.google\.com\/
Let's say that we want to pagerize threads under this URL Linux Mint 18.3 fails to install on HP laptop
There are 3 steps
url JavaScript regular expresion (RegExp)
URL we are after is
https://www.linuxquestions.org/questions/linux-softwa....
All the forum threads happen to be nested under https://www.linuxquestions.org/questions/....
So the best RegExp for the URL above is ^https:\/\/www\.linuxquestions\.org\/questions\/.
Meaning: match all URL which start with
https://www.linuxquestions.org/questions/.
nextPage XPathUse Chrome DevTools to inspect a webpage button that leads to next page.
Now, using your intelligence, try to guess the most precise selector you can write to find the next page button.
My guess would be to use a rel="next". It seems to be a good target. It is not unique on the
webpage but we don't have to worry since all it's copies move user to the next page.
XPath for it is: //a[@rel="next"]. Meaning: find all a tags, no matter where they
are in the DOM, with attribute rel equal to next.
LPT: You can press ⌘+F to test your XPath live
pageElement XPath
It is important to note that pageElement is a single post. NextPage will fetch next page
and look for all such single posts on the next page. After that NextPage will append the found posts to the
end of current series of posts.
You should use the same logic for finding XPath for a single post as in Step 2
I found a good XPath to be //div[@id="posts"]//div[@class="page"]
Append all the 3 things it in the box and add an empty line or a separator of your choice.
url ^https:\/\/www\.linuxquestions\.org\/questions\/
pageElement //div[@id="posts"]//div[@class="page"]
nextLink //a[@rel="next"]
=== *** ===