Web Scraping Webpages that Require Login info

Someone recently asked me how to go about web scraping a webpage that requires you to login. The issue with these types of situations is that the URL you are scraping wont allow you to access the data, even if you are signed into the website. The solution is that within R you actually need to set up your login info prior to scraping from the desired URL.

Let’s take a simple example. Here is the player Approximate Value page from Pro-Football-Reference:

URL

Notice that the first 10 rows are not accessible to us because they are part of a subscription service. Also, if we were to navigate to the URL (as opposed to looking at my small screen shot), you’ll see only 20 rows of data, when the actual page (once logged in) has 200 rows!

Once I log into the account I can see all of the hidden info:

Let’s start by seeing what happens when I scrape this page, now that I’m logged in



ACK!! It didn’t work. Even though I’m logged into the website, R doesn’t know it. It just reads the URL as is and thus only returns 20 rows of data with 10 of the rows being blank, since they are covered up on the website.

Solving the Problem

The way we will solve this problem is to get the URL for the login page and pass it to the session() function in R:


Once you’ve loaded up the login page then you need to get a login form and fill it out with your username and password, like this:

Technical Note: The last step where you are changing the info in field 4 to “button” is required, otherwise you will get an error in the next step. I’m not entirely sure why it is required and after failing a few times and then doing some googling, I’ve found this to be an easy enough solution.

Once you have filled out the login form, you simply submit it to the website with the session_submit() function and then repeat the web scraping process that we did at the beginning, however instead of using the read_html() function to pass the URL you will use the session_jump_to() function and provide it with the info about the login page and the URL you are scraping:


Now, all the data on that page will be available to you:

Happy scraping!!