The Notebook Review forums were hosted by TechTarget, who shut down them down on January 31, 2022. This static read-only archive was pulled by NBR forum users between January 20 and January 31, 2022, in an effort to make sure that the valuable technical information that had been posted on the forums is preserved. For current discussions, many NBR forum users moved over to NotebookTalk.net after the shutdown.

Forums closing at the end of January - Alternatives?

Discussion in 'Dell Latitude, Vostro, and Precision' started by mdsurveyor, Jan 18, 2022 at 10:10 AM.

  1. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    Very interesting, thanks for sharing. Will try to use. Two things are not clear initially:
    1) why did you bother excluding external domains
    2) what is mime:application/foobar
     
  2. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,542
    Likes Received:
    2,038
    Trophy Points:
    331
    1. External domains are included. The only one that I excluded is web.archive.org (explained in post). It'll grab images and such from external domains — avatars, images embedded into posts, .zip files linked from posts, etc.. If this is going to be a long-lasting archive, might as well make sure that stuff sticks around too.
    2. I dunno, WinHTTrack had that in there by default and I didn't touch it.
     
    etern4l likes this.
  3. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    Thanks, makes sense. That way parts of NBR might yet outlive imgur!

    I have a sed command which removes the JS - pretty straightforward, will post later.
     
  4. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,542
    Likes Received:
    2,038
    Trophy Points:
    331
    If grabbing Precision takes just a few days, I’ll try to grab the XPS section as well. That’s twice as large. I don’t think that I’ll have time for much else, but if anyone else grabs a section then I can help with hosting it.
     
    Reciever and etern4l like this.
  5. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,542
    Likes Received:
    2,038
    Trophy Points:
    331
    Adding these lines to capture:
    Code:
    +*http://forum.notebookreview.com/attachments/*/
    +*http://forum.notebookreview.com/attachments/*/*
    +*http://forum.notebookreview.com/media/*/
    +*http://forum.notebookreview.com/media/*/*
    -*http://forum.notebookreview.com/media/albums
    -*http://forum.notebookreview.com/media/*/albums
    -*http://forum.notebookreview.com/media/categories/*
    -*http://forum.notebookreview.com/media/*/report
    -*http://forum.notebookreview.com/media/*/like
    (Otherwise, high-quality versions of attachments do not get downloaded.)
     
    Last edited: Jan 20, 2022 at 2:05 PM
    Reciever and etern4l like this.
  6. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    I ran into an issue whereby the links get messed up despite the JS fix, when I specify a subforum to download.
    Luckily deleting the cache and restarting seems to have fixed things.
    The global forum exclusion route works OK though in conjunction with your member exclusions list.
    Setting depth limit also helps control the process.
     
  7. Aaron44126

    Aaron44126 Notebook Prophet

    Reputations:
    874
    Messages:
    5,542
    Likes Received:
    2,038
    Trophy Points:
    331
    The goto links (quote up arrow) are going to need post-processing. HTTrack doesn't handle them right and clicking on one on the offline copy gives a 404 error. Rearranging how it does the hashtag part of the link would fix it; should be able to be fixed with regex find-and-replace. (All of the 404 error HTML files that it leaves on the disk can be cleaned up too with a simple script.)

    Probably could use with some post-processing on the attachments, too. HTTrack doesn't realize the file type and names them all "index.html". You can get the file type from the name of the folder that it is in.

    After running a small-scale test to make sure that it would actually finish I have another update to the rules. A lot of the stuff at the bottom has to do with the ads, I think. I also added some more limits to the media section to keep it from spidering into other users' media, just keeping it limited to what is embedded in a thread (hopefully).
    Code:
    +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.zip +*.rar -ad.doubleclick.net/* -mime:application/foobar
    +*http://forum.notebookreview.com/goto/*
    +*http://forum.notebookreview.com/attachments/*/
    +*http://forum.notebookreview.com/attachments/*/*
    +*http://forum.notebookreview.com/media/*/
    +*http://forum.notebookreview.com/media/*/*
    -*http://forum.notebookreview.com/media/albums
    -*http://forum.notebookreview.com/media/albums/*
    -*http://forum.notebookreview.com/media/users
    -*http://forum.notebookreview.com/media/users/*
    -*http://forum.notebookreview.com/media/*/albums
    -*http://forum.notebookreview.com/media/categories/*
    -*http://forum.notebookreview.com/media/*/report
    -*http://forum.notebookreview.com/media/*/like
    -*?order=*
    -*?direction=
    -*?type=
    -*?container=
    -*members/*/trophies*
    -*members/*/following*
    -*members/*/followers*
    -*members/*/report*
    -*members/*/post*
    -*members/*/recent-activity*
    -*members/*/recent-content*
    -*members/*/reputation*
    -*web.archive.org*
    -*nanoWidget* -*brighttalk.com* -*userActions* -*adCarousel* -*cardsCarouselBox* -*displayAd* -*dynamicCarousel*
    -*embeddedArticles* -*flyThrough* -*gridCarousel* -*inPlayerWidget* -*leadForms* -*loadMore* -*outbrain* -*performanceMonitor* -*playableAd*
    -*popupDescription* -*publisherTools* -*readMore* -*readNext* -*refreshWidget* -*singleAnimationOnFeed*
    -*singleCardcarousel* -*skyLander* -*stackCard* -*stashRenderer* -*streamFeed* -*swipeLayout*
    -*topBox* -*userZapping* -*webVitals* -*widgetInjector* -*widgetWizard*
    I'm also giving up on downloading member profile pages. That spiders out too much with people following each other.
     
    Last edited: Jan 20, 2022 at 3:44 PM
    etern4l and Reciever like this.
  8. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    I guess lot of these additions are only needed if one wishes to mirror external links. I kind of gave up on direct individual forum mirroring. Just capturing all minus the forums I don't care about (almost all actually). Restricted depth to 10. Hopefully the links will work.
    Just hope that any pages with messed up links will get refreshed as I am doing the full update. So far so good.
     
    Last edited: Jan 20, 2022 at 4:14 PM
  9. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    Blowing away the cache effectively causes it to re-download all HTMLs.. Ugh. Unfortunately, it was likely necessary since some pages ended up with broken navigation after attempts to download specific forums only.

    Nevertheless, all is back to normal. Here is the httrack --userdef-cmd which automatically removes the offending JS statement in each html file, if that helps (on Windows probably best to use the sed in msys2):
    Code:
    --userdef-cmd "sed -i 's/if (\_b \&\& \_b\.href \!\= \_bH) \_b\.href \= \_bH\;//g' \$0"
    
    All looking good now, obviously very slow, but what gets fetched looks and behaves as expected... Globally excluding the undesirable forums seems to be working OK so far (in conjunction with the member exclusions).

    The other option I found useful:

    --purge-old=0
    Very important - constraining the search doesn't remove previously downloaded files no longer included.
     
    Last edited: Jan 20, 2022 at 10:03 PM
    Reciever likes this.
  10. etern4l

    etern4l Notebook Virtuoso

    Reputations:
    2,911
    Messages:
    3,524
    Likes Received:
    3,442
    Trophy Points:
    331
    There is a snag: can't see any sigs. Could it be one of the member exclusions? Will take a look at the code.

    Edit: well, that was silly - they don't show up unless user is logged in... anyway, they are still there in index.html for each member...
     
    Last edited: Jan 20, 2022 at 8:27 PM
Loading...

Share This Page