Author Archives: Patrick

Web Scraping Webpages that Require Login info

Someone recently asked me how to go about web scraping a webpage that requires you to login. The issue with these types of situations is that the URL you are scraping wont allow you to access the data, even if you are signed into the website. The solution is that within R you actually need to set up your login info prior to scraping from the desired URL.

Let’s take a simple example. Here is the player Approximate Value page from Pro-Football-Reference:

URL

Notice that the first 10 rows are not accessible to us because they are part of a subscription service. Also, if we were to navigate to the URL (as opposed to looking at my small screen shot), you’ll see only 20 rows of data, when the actual page (once logged in) has 200 rows!

Once I log into the account I can see all of the hidden info:

Let’s start by seeing what happens when I scrape this page, now that I’m logged in



ACK!! It didn’t work. Even though I’m logged into the website, R doesn’t know it. It just reads the URL as is and thus only returns 20 rows of data with 10 of the rows being blank, since they are covered up on the website.

Solving the Problem

The way we will solve this problem is to get the URL for the login page and pass it to the session() function in R:


Once you’ve loaded up the login page then you need to get a login form and fill it out with your username and password, like this:

Technical Note: The last step where you are changing the info in field 4 to “button” is required, otherwise you will get an error in the next step. I’m not entirely sure why it is required and after failing a few times and then doing some googling, I’ve found this to be an easy enough solution.

Once you have filled out the login form, you simply submit it to the website with the session_submit() function and then repeat the web scraping process that we did at the beginning, however instead of using the read_html() function to pass the URL you will use the session_jump_to() function and provide it with the info about the login page and the URL you are scraping:


Now, all the data on that page will be available to you:

Happy scraping!!

 

TidyX 95: Interactive RMarkdown Reports with DT and plotly

In additional to being our 95th episode it also marks 2 full years of TidyX! What an incredible journey and thanks to all of those who continue to watch and support our work.

This week, Ellis Hughes and I discuss how you can make your RMarkdown reports interactive using html widgets. More specifically, we go through how to build data tables from the {DT} package and visualizations from the {plotly} package to make your reports come to life!

To watch the screen cast, CLICK HERE.

To access our code, CLICK HERE.

Issues with ‘Black Box’ Machine Learning Models in Injury Prediction

Injury prediction models developed using machine learning approaches have become more common due to the substantial rise of proprietary software in the sports science and sports medicine space. However, such ‘black box’ approaches are not without limitation. Aside from a lack of transparency, preventing independent evaluation of model performance, these types of models present challenges in interpretation, making it difficult for practitioners who are required to make decisions about athlete health and plan interventions.

I recently had the pleasure of working on a paper headed up by Garrett Bullock and a list of wonderful co-authors where we discuss some of these issues:

Black Box Prediction Methods in Sports Medicine Deserve a Red Card for Reckless Practice: A Change of Tactics is Needed to Advance Athlete Care. Sports Med.

TidyX 94: RMarkdown Parameterized Reports

This week, Ellis Hughes and I discuss how you can add controls to your RMarkdown by setting the parameters arguments within the YAML.

If you have a custom report that you need to reproduce frequently, changing different groups or pieces of information, parameterized reports are a great way to save time and ensure reproducibility.

For example, say you work for an NBA team and the head coach wants to see a team report on the Miami Heat, Dallas Mavericks, and Phoenix Suns. Rather than changing the contents within the RMarkdown itself (copying and pasting the new team name, seasons, weeks of year, etc.), which opens you up to making errors, you can set specific parameters that you want to exert control over within the YAML. Once you Knit the document with those parameters you can make the changes you need (IE, select the team, season, and weeks of the year) and the report will be produced with the desired info.

To watch our screen cast, CLICK HERE.

To access our code, CLICK HERE.