Forums closing at the end of January - Alternatives?

Aaron44126 · Jan 19, 2022 at 4:41 PM

etern4l said: ↑

Can you spare me some gawking and post the command to automatically remove the offe ding javascript?
Click to expand...

I haven't done it yet but I'm planning to do it with Notepad++ and the "find and replace in files" function. Should be able to tell it to check all *.html files, and then search for that line of JavaScript and replace it with nothing. (Once for each of the two lines. Or really just doing it for the second line should be fine.)

Decided that the "spidering" from the "similar threads" is out of control so I came up with a different approach.
1. Download the "forum" base URL only (it should grab all of the different pages for the specific forum, the list of threads but not the threads themselves).
2. Search the output for lines starting with "data-previewUrl="threads/" ... grep works fine here.
3. Use that to build a list of all of the thread base URLs. (Notepad++ & some find-and-replace)
4. Hand that to HTTrack and have it download those along with the forum base URL.

etern4l · Jan 19, 2022 at 5:22 PM

OK, almost there
Thanks to your observation it's just the one line of js that needs to go. Should be a sed one liner. BTW if you are struggling with 403s, I noticed that - cookies=0 does the trick.

Aaron44126 · Jan 19, 2022 at 8:13 PM

etern4l said: ↑

OK, almost there
Thanks to your observation it's just the one line of js that needs to go. Should be a sed one liner. BTW if you are struggling with 403s, I noticed that - cookies=0 does the trick.
Click to expand...

Thanks, killing cookies did seem to help.

etern4l · Jan 20, 2022 at 3:17 AM

Yep, everything is working now, thanks to the post processing step to remove the if statement in the offending JS.

Now waiting for the AW threads to get picked up at full depth.... It's impressive how much content has been produced since 2004.

Reciever · Jan 20, 2022 at 3:20 AM

Would one or both of you guys be willing to make a quick and dirty guide? I'd like to put it in my sig to help get the word going around.

etern4l · Jan 20, 2022 at 3:51 AM

Reciever said: ↑

Would one or both of you guys be willing to make a quick and dirty guide? I'd like to put it in my sig to help get the word going around.
Click to expand...

The problem is: what happens if lots of people start trying to do the same thing, which is grab the entire site. By my rough calculations, need about 10 days to fetch the whole thing, and if we effectively self-DDoS the site, we might not get that much. Once someone fetches it all, it will be easy to share privately or perhaps on bt.

We could try grabbing different sections of the forum, but I am not sure how specifying a different starting url affects the process. Early observations indicated it starts from the top index of the forum anyway.

etern4l · Jan 20, 2022 at 4:08 AM

One more idea on how to grab a subsection: exclude the main index, all undesirable subforum indices. Hopefully this will stop it from accessing the other sections. It might still grab some extra content through direct links, but that'd hopefully be minor. Unfortunately, I won't have the time to try this until late this evening.

Aaron44126 · Jan 20, 2022 at 6:45 AM

etern4l said: ↑

One more idea on how to grab a subsection: exclude the main index, all undesirable subforum indices. Hopefully this will stop it from accessing the other sections. It might still grab some extra content through direct links, but that'd hopefully be minor. Unfortunately, I won't have the time to try this until late this evening.
Click to expand...

I'm grabbing the Precision section. I produced a list of URLs and handed that to HTTrack (see attached). It ran overnight and grabbed the first page about half of the threads. It'll still take a while to get through (some of the threads are a few hundred pages long). If I have time I'll expand to other Dell sections but this is actually the smallest one so I'm not sure.

I can produce URL lists for other sections. Just let me know what you want. I'll post instructions on how I am working with WinHTTrack here in a bit.

etern4l · Jan 20, 2022 at 7:23 AM

Yes, brute forcing this is a no go. The index of "which notebook should I buy" alone has 5500 pages lol

It seems to be doing mostly breadth-first search so getting first and last few pages from each first and last index pages. The initial number of 100k objects was grossly underestimated. I imagine we are talking millions.

A massive exclusion list should be able to help by blocking off key index nodes in the graph, but this is laborious. Curious to see what you came up with.

Aaron44126 · Jan 20, 2022 at 2:06 PM

etern4l said: ↑

I imagine we are talking millions.
Click to expand...

Yeah, easily in the millions. There's not enough time to get "everything" (given the throttling that is required to make the firewall happy anyway).

web.archive.org has a huge chunk of it cached. It might be possible to fetch pages from there to reconstruct parts of the forum once the real thing goes away. (...It's way slower to access through there.)

After messing with this a lot yesterday, I believe that I have settled on something that works. Here's how I am using WinHTTrack. (I know there's a command line for HTTrack as well, but I'm using the GUI right now, it was just easier to get started that way for me.)

* Open WinHTTrack, and give your project a name and base path (which is where all of the stuff will be downloaded to).
* Next it wants the web addresses to start from. You can hand it a forum URL, a thread URL, or a list of many URLs. If there are two many to fit in the box then you can drop them to a text file and put that in the "URL list" field.
* Next, the "Set options" button.
** Browser ID tab: I put a "real" browser user agent in the browser identity field; not sure how much it matters but I don't want to look like HTTrack if I can help it. Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0
** Spider tab: Uncheck "Accept cookies".
** Flow control tab: Set to 1 connection, uncheck "Keep-Alive"
** Limits tab: I set "max size for any non-HTML file" to 26214400 (25 MB), and max connections / seconds to 0.5 (one connection every two seconds)
** Scan rules tab: This is what I have.
Code:
+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.zip +*.rar -ad.doubleclick.net/* -mime:application/foobar
+*http://forum.notebookreview.com/members/*
+*http://forum.notebookreview.com/goto/*
+*http://forum.notebookreview.com/attachments/*/
+*http://forum.notebookreview.com/attachments/*/*
+*http://forum.notebookreview.com/media/*/
+*http://forum.notebookreview.com/media/*/*
-*http://forum.notebookreview.com/media/albums
-*http://forum.notebookreview.com/media/*/albums
-*http://forum.notebookreview.com/media/categories/*
-*http://forum.notebookreview.com/media/*/report
-*http://forum.notebookreview.com/media/*/like
-*?order=*
-*?direction=*
-*members/*/trophies*
-*members/*/following*
-*members/*/followers*
-*members/*/report*
-*members/*/post*
-*members/*/recent-activity*
-*members/*/recent-content*
-*members/*/reputation*
-*web.archive.org*
The second and third line "members" and "goto" could be excluded to speed things up. The "members" line has it download user profiles that it runs across. (It will also download the "members" page linked at the top which has some top users in various categories, and those user profiles. This could be excluded but I haven't bothered with that.) The exclusions below keep it from checking other pages beyond the base member profile (...this is sort of left over from past attempts where I had to prevent it from noticing users "recent posts" in other forums but I still think they're good to keep.) The "goto" line will basically allow the "up arrow" link on quoted posts to work but it means an extra hit for every quote that it runs across. I am excluding "web.archive.org" links at the bottom there because I actually ran into a link to this very forum, but on web.archive.org, and it started spidering around there too. (Quite possibly my own link from earlier in this thread.)

That's it, then you basically start the job and wait.

As for what URLs to give it in the first step, I first handed it the forum URL only, so it would download the pages with the thread lists, and from there I generated the thread URLs for that forum (see above). I stuffed those into a .txt file (attached two posts up) and used it for the "URL list" on the next run.

If you gave it a single thread URL for the first page of the thread, it should be able to pull the whole thread down (all of the pages) and nothing else that's not required.

When it's done I plan to do some post-processing; removing the JS that causes it to link back to the real NBR when it shouldn't (discussed above), and adding an info bar at the top of each page identifying this as a "mostly working" read-only archive; then, I'll post it on my web server. Once this site actually disappears, I'll add it to Google and DuckDuckGo so the content can be indexed and remain searchable.

Log in or Sign up

Forums closing at the end of January - Alternatives?

Aaron44126 Notebook Prophet

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

etern4l Notebook Virtuoso

Reciever D! For Dragon!

etern4l Notebook Virtuoso

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

Attached Files:

threads.txt

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

All TechnologyGuide Forums will close Jan 31, 2022

All TechnologyGuide Forums will close Jan 31, 2022

Forum closing!

Share This Page

Log in or Sign up

Forums closing at the end of January - Alternatives?

Aaron44126 Notebook Prophet

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

etern4l Notebook Virtuoso

Reciever D! For Dragon!

etern4l Notebook Virtuoso

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

Attached Files:

threads.txt

etern4l Notebook Virtuoso

Aaron44126 Notebook Prophet

All TechnologyGuide Forums will close Jan 31, 2022

All TechnologyGuide Forums will close Jan 31, 2022

Forum closing!

Share This Page

Useful Searches