SQL Saturday statistics – Web Scraping with R and SQL Server

By tomaztsql

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule 

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule %
 read_html()

# Event schedule
 schedule_info 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! 🙂

As always, code is available at Github.

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.