Subject: Re: OT: Bill Ackman
Wow, that's impressive!!!
For some reason (mostly because it was possible, I think, and I couldn't believe how easy it was) I started systematically downloading everything from the first post on, starting mid-2010. They allowed anonymous http requests, exposed the API right on the URL, used a very predictable integer PK, and anyone who posted there with any regularity "grokked" how it worked.
There were all kinds of hacks to that system; do you remember when they botched an upgrade and accidentally allowed font and color changes? They spent hours chasing down all the violating posts. I still have a reference to the original "Breakfast!" post which hit BestOf in 24pt font (and consisted of one tiny pancake.) somewhere. They tried to delete that too, of course, but thanks to the bug that didn't check to see if a post was deleted before loading the reply page, which contained an entire copy of the OP, nothing was ever *really* deleted, just hidden. That was a fine time for me.
Anyway, I grabbed everything to date in 2010 -- ~1TB in ~18 mil .html files (it took months!) -- and it sat on an external hard drive for over a decade. These AI coding tools came out right as I was starting to worry that background radiation would scramble my precious files, giving me both motivation and means to do something with it. And using this non-trivial data processing project to build and validate the platform for my wife's research project was a perfect fit. Message board posts, genetic data, they're *basically* the same, right? :D
I do deeply regret not figuring out how to do authenticated requests in batch files, because you *did* have to be logged-in to post a reply, and so could not see any "deleted" posts anonymously. So I left all of those behind. I have a little side project where I search the Internet Archive for those deleted posts, hoping I can catch a snapshot after it was posted but before it was deleted. Can't say for sure that I've recovered any that way, but I *have* found many posts there. There's just not enough time in the day to look at everything.