More scraping obstacles.

Matt told me I really need to get at least 500 posts’ worth of data to proceed with the data exploration so I’ve been at it since Tuesday. So frustrating, I thought I had my scraping code finalized on Saturday and yet it hasn’t worked well enough to get me the complete data set. Here are some of the obstacles:

There are 10 posts on each page of features on 538 so to scrape 500 I need to go through 50 pages. The first time I tried to get them all the code got an error on page 18. The culprit was a live-blog post on the senate run-off in Georgia, and there was no comments section. Small obstacle. I modified the code to skip live-blog posts, and now I’m scraping 51 pages so I can get at least 500 data points.

Then the BIG obstacle came when I tried to scrape again and this time the code got an error on page 21. I didn’t realize how big of a problem this was and how long it was going to take me to fix it. I looked at each of the posts on page 21 and didn’t see anything out of the ordinary. I tried scraping just that page and it worked fine, so I started at the beginning again, this time printing each url that was getting scraped. Got the error on page 21 again but this time I knew exactly which url caused it. I went there, clicked on the comments button, and the plugin opened with a blank box. So no comments? I should’ve been suspicious that it took 21 pages to find a post with no comments. But I modified the code again to return 0 comments when there is a blank box. I ran it again, just on page 21, and stopped it after 2 posts. I looked at the output file and both posts had 0 comments! Hm, I thought. Then I clicked on a post from page 1 and clicked on the comments button. Empty box. So this was a bigger problem than I thought.

I started Googling around to find out what was wrong. I cleared my cache, restarted my browser, tried opening a url in a different browser (Edge), tried opening a url in Incognito Mode. Still getting the same problem. At this point I was already sure the problem was that I’d made too many requests and now my IP address was getting blocked. I tried waiting, and eventually the comments came back on some urls, so I tried scraping again and this time I got 502 posts with no errors. I happily loaded the output file into my Data Exploration notebook and tried out some scripts I had written to draw bar graphs of the average and median number of comments for posts with different features (I started calling the different post types, authors, and tags “features” since that’s what they are in statistics language, I started calling the headings Post type, Author(s), and Tags “attributes”, and I started calling the data points “posts”, to try to start standardizing the language I use to describe all of these things). Then I saw the bars for the medians weren’t showing up. I checked my scripts, they seemed fine. When I had used the 179 data points I’d gotten when the scraping stopped on page 18 (oh yeah, I figured out that since I wrote the code to create the output file in a different cell, I could still compile it even when there was an error in the scraping) the graphs showed up fine. Then I looked at the output file and saw 0s under comments, starting at the 50th post. I’d forgotten to remove that code that gave me a 0 when the comments wouldn’t load.

Around that time, Matt’s office hours had started so I asked him about it. He asked for one of the urls and said when he opened it the comments loaded. So we were both sure I was getting blocked. He suggested I put a sleep timer in the code, so my requests don’t come in too fast. He said just 1 or 5 seconds but I wasn’t sure that would be enough, since scraping the comments from one post already took 15-30 seconds. But I added a timer for 10 seconds, then removed that code that gave me the 0s, then tried again. This time it didn’t work because when I had deleted that code, I deleted too much. Specifically, the code that defines the variable the comments scraping function returns. And I couldn’t remember what I had written! I did my best to rewrite it, then ran it again over night.

At first I was excited this morning when I saw it gave all 502 posts, so I opened the output file and this time there was an “N” under comments for every post! I’m still not sure how that happened. I started fiddling around with the comments scraping function, adding print statements and Googling things to make sure my syntax was correct, and this time I started getting errors that said the html tags I searched for with BeautifulSoup, where the number of comments was located, was returning “None”. But I checked the urls and the comments plugin was loading. I thought there must’ve been something wrong with the code I had to rewrite. I realized there must be an older version of the notebook somewhere that I can revert back to, so I tried that. Jupyter lets you revert to the most recent checkpoint, but mine had been too recent. I looked in the checkpoints folder and only found the most recent checkpoints. It looks like Jupyter deletes all but the most recent one. Then I thought, since it’s been awhile since I’ve pushed the changes in my notebook to the remote Github repository, there must be an old enough version there. I was right! Turns out there was only one line of code missing, and I had been using the correct syntax. So now what?

At this point the comments were loading on every url I tried, but my code wasn’t finding them so I thought maybe there was something wrong with Selenium’s JavaScript executor function. But I couldn’t see what could be wrong, and once again I was led to the conclusion that I was still being blocked. So I started looking for free IP address rotators. Should’ve known this would be something I’d have to pay for. I found a free trial of a service called Webshare. Not sure how long the trial lasts, a week? Hopefully that’s all I’ll need. I signed up and it took me to the dashboard, and from there I didn’t know what to do. The dashboard said I had 10 free proxies from 4 different countries. Is that enough? Then I tried the scraping code again, this time adding a print statement that tells me the number of comments, and it’s working now! It’s currently on page 27. I think I finally fixed the problem.

There was one other curious thing that happened. This morning I either tried to log in to Facebook or I saw an email from them, I don’t remember which happened, but I found out my account had been locked. They cited suspicious activity because I had given them my phone number about a week ago. They also asked me if I had created an app on the developer’s page (I had, back when I was trying to find a Facebook comments plugin scraper API – that ended up not working). I went through the stages of restoring my account and now it’s unlocked again. I’m wondering if all of that was related to me not being able to scrape. I’m not sure, but at this point I’m just glad my scraper works.

So another long post, this time not really about coding, which is what I had intended this blog to be about, but I guess that’s OK. In the next couple of days I’m going to write about the trouble I had with this function I was trying to write yesterday, and the unsupervised learning technique PCA. So now this blog is also going to have posts about machine learning. Looking forward to it.

Written on March 30, 2023