The Notebook Review forums were hosted by TechTarget, who shut down them down on January 31, 2022. This static read-only archive was pulled by NBR forum users between January 20 and January 31, 2022, in an effort to make sure that the valuable technical information that had been posted on the forums is preserved. For current discussions, many NBR forum users moved over to NotebookTalk.net after the shutdown.

Forums closing at the end of January - Alternatives?

Discussion in 'Dell Latitude, Vostro, and Precision' started by mdsurveyor, Jan 18, 2022 at 10:10 AM.

  1. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,543
    Likes Received:
    2,038
    Trophy Points:
    331
    I haven't done it yet but I'm planning to do it with Notepad++ and the "find and replace in files" function. Should be able to tell it to check all *.html files, and then search for that line of JavaScript and replace it with nothing. (Once for each of the two lines. Or really just doing it for the second line should be fine.)

    Decided that the "spidering" from the "similar threads" is out of control so I came up with a different approach.
    1. Download the "forum" base URL only (it should grab all of the different pages for the specific forum, the list of threads but not the threads themselves).
    2. Search the output for lines starting with "data-previewUrl="threads/" ... grep works fine here.
    3. Use that to build a list of all of the thread base URLs. (Notepad++ & some find-and-replace)
    4. Hand that to HTTrack and have it download those along with the forum base URL.
     
    Last edited: Jan 19, 2022 at 4:41 PM
    Reciever likes this.
  2. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    OK, almost there
    Thanks to your observation it's just the one line of js that needs to go. Should be a sed one liner. BTW if you are struggling with 403s, I noticed that - cookies=0 does the trick.
     
    Reciever and Aaron44126 like this.
  3. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,543
    Likes Received:
    2,038
    Trophy Points:
    331
    Thanks, killing cookies did seem to help.
     
    etern4l and Reciever like this.
  4. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    Yep, everything is working now, thanks to the post processing step to remove the if statement in the offending JS.

    Now waiting for the AW threads to get picked up at full depth.... It's impressive how much content has been produced since 2004.
     
    Last edited: Jan 20, 2022 at 3:17 AM
  5. Reciever

    Reciever D! For Dragon!

    Reputations:
    1,491
    Messages:
    5,320
    Likes Received:
    4,090
    Trophy Points:
    431
    Would one or both of you guys be willing to make a quick and dirty guide? I'd like to put it in my sig to help get the word going around.
     
    Tenoroon likes this.
  6. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    The problem is: what happens if lots of people start trying to do the same thing, which is grab the entire site. By my rough calculations, need about 10 days to fetch the whole thing, and if we effectively self-DDoS the site, we might not get that much. Once someone fetches it all, it will be easy to share privately or perhaps on bt.

    We could try grabbing different sections of the forum, but I am not sure how specifying a different starting url affects the process. Early observations indicated it starts from the top index of the forum anyway.
     
    dmanti likes this.
  7. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    One more idea on how to grab a subsection: exclude the main index, all undesirable subforum indices. Hopefully this will stop it from accessing the other sections. It might still grab some extra content through direct links, but that'd hopefully be minor. Unfortunately, I won't have the time to try this until late this evening.
     
  8. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,543
    Likes Received:
    2,038
    Trophy Points:
    331
    I'm grabbing the Precision section. I produced a list of URLs and handed that to HTTrack (see attached). It ran overnight and grabbed the first page about half of the threads. It'll still take a while to get through (some of the threads are a few hundred pages long). If I have time I'll expand to other Dell sections but this is actually the smallest one so I'm not sure.

    I can produce URL lists for other sections. Just let me know what you want. I'll post instructions on how I am working with WinHTTrack here in a bit.
     

    Attached Files:

    etern4l likes this.
  9. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    Yes, brute forcing this is a no go. The index of "which notebook should I buy" alone has 5500 pages lol

    It seems to be doing mostly breadth-first search so getting first and last few pages from each first and last index pages. The initial number of 100k objects was grossly underestimated. I imagine we are talking millions.

    A massive exclusion list should be able to help by blocking off key index nodes in the graph, but this is laborious. Curious to see what you came up with.
     
  10. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,543
    Likes Received:
    2,038
    Trophy Points:
    331
    Yeah, easily in the millions. There's not enough time to get "everything" (given the throttling that is required to make the firewall happy anyway).

    web.archive.org has a huge chunk of it cached. It might be possible to fetch pages from there to reconstruct parts of the forum once the real thing goes away. (...It's way slower to access through there.)

    After messing with this a lot yesterday, I believe that I have settled on something that works. Here's how I am using WinHTTrack. (I know there's a command line for HTTrack as well, but I'm using the GUI right now, it was just easier to get started that way for me.)

    * Open WinHTTrack, and give your project a name and base path (which is where all of the stuff will be downloaded to).
    * Next it wants the web addresses to start from. You can hand it a forum URL, a thread URL, or a list of many URLs. If there are two many to fit in the box then you can drop them to a text file and put that in the "URL list" field.
    * Next, the "Set options" button.
    ** Browser ID tab: I put a "real" browser user agent in the browser identity field; not sure how much it matters but I don't want to look like HTTrack if I can help it. Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0
    ** Spider tab: Uncheck "Accept cookies".
    ** Flow control tab: Set to 1 connection, uncheck "Keep-Alive"
    ** Limits tab: I set "max size for any non-HTML file" to 26214400 (25 MB), and max connections / seconds to 0.5 (one connection every two seconds)
    ** Scan rules tab: This is what I have.
    Code:
    +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.zip +*.rar -ad.doubleclick.net/* -mime:application/foobar
    +*http://forum.notebookreview.com/members/*
    +*http://forum.notebookreview.com/goto/*
    +*http://forum.notebookreview.com/attachments/*/
    +*http://forum.notebookreview.com/attachments/*/*
    +*http://forum.notebookreview.com/media/*/
    +*http://forum.notebookreview.com/media/*/*
    -*http://forum.notebookreview.com/media/albums
    -*http://forum.notebookreview.com/media/*/albums
    -*http://forum.notebookreview.com/media/categories/*
    -*http://forum.notebookreview.com/media/*/report
    -*http://forum.notebookreview.com/media/*/like
    -*?order=*
    -*?direction=*
    -*members/*/trophies*
    -*members/*/following*
    -*members/*/followers*
    -*members/*/report*
    -*members/*/post*
    -*members/*/recent-activity*
    -*members/*/recent-content*
    -*members/*/reputation*
    -*web.archive.org*
    The second and third line "members" and "goto" could be excluded to speed things up. The "members" line has it download user profiles that it runs across. (It will also download the "members" page linked at the top which has some top users in various categories, and those user profiles. This could be excluded but I haven't bothered with that.) The exclusions below keep it from checking other pages beyond the base member profile (...this is sort of left over from past attempts where I had to prevent it from noticing users "recent posts" in other forums but I still think they're good to keep.) The "goto" line will basically allow the "up arrow" link on quoted posts to work but it means an extra hit for every quote that it runs across. I am excluding "web.archive.org" links at the bottom there because I actually ran into a link to this very forum, but on web.archive.org, and it started spidering around there too. (Quite possibly my own link from earlier in this thread.)

    That's it, then you basically start the job and wait.

    As for what URLs to give it in the first step, I first handed it the forum URL only, so it would download the pages with the thread lists, and from there I generated the thread URLs for that forum (see above). I stuffed those into a .txt file (attached two posts up) and used it for the "URL list" on the next run.

    If you gave it a single thread URL for the first page of the thread, it should be able to pull the whole thread down (all of the pages) and nothing else that's not required.

    When it's done I plan to do some post-processing; removing the JS that causes it to link back to the real NBR when it shouldn't (discussed above), and adding an info bar at the top of each page identifying this as a "mostly working" read-only archive; then, I'll post it on my web server. Once this site actually disappears, I'll add it to Google and DuckDuckGo so the content can be indexed and remain searchable.
     
    Last edited: Jan 20, 2022 at 2:06 PM
Loading...

Share This Page