How to Use googlesheets to Connect R to Google Sheets

By Rob Grant

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Often I use R to handle large datasets, analyze the data and filter out the data I don’t need.

When all this is done, I usually use write.csv() to print my data off and reopen it in Google Sheets.

My workflow would look something like this:

full_data 

However, there's an R package that provides a bridge between your Google account and your R environment: googlesheets.

Using this package we can read data directly from Google, modify it and create new files within our Google Drive.

Step 1: Install googlesheets

install.packages(googlesheets)
library(googlesheets)

Step 2: Authenticate your Google account

Before we can do anything we need to allow google sheets to access our account.
We can do this by running this:

gs_auth(new_user = TRUE)

Have a browser open (Google Chrome worked for me) and it should open a new tab asking you to connect via an account:

Click on an account below this message and then ‘allow’ and it should take you to a page saying it has worked and to go back to R.

You can rerun this command any time you want to change accounts.

Sometimes if you don’t use the token for a while it will run out and you will have to refresh it, which it will initiate automatically if you run a command that requires you to connect to the Google API (i.e. any of the specialised googlesheets functions).

Step 3: See what’s in your Google Account

Calling the function gs_ls() will show you spreadsheets in your account.

gs_ls()
# A tibble: 15 x 10
 sheet_title author perm version updated sheet_key
 1 for googlesheets rforjournali… rw new 2017-12-11 09:44:54 1Y0WCfTW…
 2 Avon and Somerset Septe… rforjournali… rw new 2017-11-19 12:46:55 1TfC5Fs6…
 3 Mid year 2015 UK popula… rforjournali… rw new 2017-10-25 21:19:52 1Vqg560s…
 4 Cleveland 2016-09 rforjournali… rw new 2016-11-26 10:18:16 19xBr8nU…
 5 Rankings of US presiden… rforjournali… rw new 2016-11-08 04:39:55 11PZxq7y…
 6 Tennis #1s rforjournali… rw new 2016-11-06 20:04:42 1Riz8GRs…
 7 Young persons railcard rob.grant rw new 2016-11-06 13:05:55 1XZsjJxu…
 8 Copy of Young persons r… rforjournali… rw new 2016-11-05 18:14:38 1oUpRS-D…
 9 defective rforjournali… rw new 2016-11-05 11:40:30 1jWZBILC…
10 Asylum rforjournali… rw new 2016-10-27 19:04:05 1CRMl2_1…
11 Buses rforjournali… rw new 2016-10-24 20:07:41 1qy9Z-sn…
12 Untitled spreadsheet rforjournali… rw new 2016-10-24 19:22:42 1_f_FI5n…
13 Population rforjournali… rw new 2016-10-24 18:29:17 1rrOQuV5…
14 Drugs rforjournali… rw new 2016-10-18 21:37:29 1UTsnGM6…
15 Food rforjournali… rw new 2016-10-15 13:24:30 1aWEAPR4…

Step 4: Read a spreadsheet

I am going to select the first spreadsheet ‘for googlesheets’ by its title. It’s a selection of 50 random numbers between 0 and 1 (you can recreate this function with runif() in R).

for_gs 

You can also locate the sheet by the key (the letters, numbers and characters after the /d/ in the URL) for the same result

for_gs 

This gives us a list, which we can turn into a data frame using the gs_read() command.

for_gs_sheet Classes ‘tbl_df', ‘tbl' and 'data.frame': 50 obs. of 2 variables:
 $ Number.x: num 0.4696 0.1587 0.0949 0.1823 0.0885 ...
 $ Number.y: num 0.67551 0.7041 0.00167 0.51302 0.20114 ...
 - attr(*, "spec")=List of 2
 ..$ cols :List of 2
 .. ..$ Number.x: list()
 .. .. ..- attr(*, "class")= chr "collector_double" "collector"
 .. ..$ Number.y: list()
 .. .. ..- attr(*, "class")= chr "collector_double" "collector"
 ..$ default: list()
 .. ..- attr(*, "class")= chr "collector_guess" "collector"
 ..- attr(*, "class")= chr "col_spec"

Step 5: Modify the spreadsheet

Next up, we modify our spreadsheet using the gs_edit_cells() function.

This function has several arguments that we need to employ to edit our spreadsheet properly.

gs_edit_cells(for_gs, ws = "Sheet1", anchor = "A2", input = c(1,2), byrow = TRUE)

The ws argument refers to the sheet name in the spreadsheet. The anchor argument refers to the cell from which the modification will begin. In my example, I am editing two cells, where the first one will be the anchor cell A2. The byrow argument indicates that the modification will apply horizontally (change to FALSE for vertical editing).

Note that this won’t change our data frame for_gs_sheet that is based on this spreadsheet; just the spreadsheet itself.

Cell A2 now has a value of 1. A3 is 2.

Step 6: Create a Google Sheets file using R

We can create new spreadsheets using this package using gs_new().

We’ll use the mtcars dataset as a test:

gs_new(title = "mtcars", ws_title = "first_sheet", input = mtcars)

It worked, except it didn’t include the rownames, which contains the cars.

That doesn’t matter, we can add them using gs_edit_cells(), changing the byrow argument to FALSE this time.

#register the new mtcars sheet in R
gs_new(title = "mtcars", ws_title = "first_sheet", input = mtcars)

#insert the rownames vertically in column L
gs_edit_cells(mtcars_sheet, ws = "first_sheet", anchor = "L2", input = rownames(mtcars), byrow = FALSE)

Final thoughts

That was a quick overview of the most basic functions of the google sheets package.

This is a really useful package. A lot of my work involves reading data in Google Sheets either before or after using R.

Googlesheets means I won’t have to bother with read.csv() or write.csv() as much in the future, saving me time.

So thanks to Jenny Bryan for creating it!

    Related Post

    1. DataFrames Vs RDDs in Spark – Part 1
    2. Parallel Operations
    3. R Markdown: How to number and reference tables
    4. RDBL – manipulate data in-database with R code only
    5. R Markdown: How to format tables and figures in .docx files
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    R⁶ — Capture Tweets with tweet_shot()

    By hrbrmstr

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)
    (You can find all R⁶ posts here)

    A Twitter discussion:

    I’m going to keep my eyes out for this one! Would love to have an easy way to embed tweets in Rmd talks!

    — Jeff Hollister (@jhollist) December 30, 2017

    that spawned from Maëlle’s recent look-back post turned into a quick function for capturing an image of a Tweet/thread using webshot, rtweet, magick and glue.

    Pass in a status id or a twitter URL and the function will grab an image of the mobile version of the tweet.

    The ultimate goal is to make a function that builds a tweet using only R and magick. This will have to do until the new year.

    tweet_shot  0)
        x  1) img 

    Now just do one of these:

    tweet_shot("947082036019388416")
    tweet_shot("https://twitter.com/jhollist/status/947082036019388416")

    to get:

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Five tips to improve your R code

    By Simon Jackson

    factor-1-1.png

    (This article was first published on blogR, and kindly contributed to R-bloggers)

    @drsimonj here with five simple tricks I find myself sharing all the time with fellow R users to improve their code!

    This post was originally published on DataCamp’s community as one of their top 10 articles in 2017

    1. More fun to sequence from 1

    Next time you use the colon operator to create a sequence from 1 like 1:n, try seq().

    # Sequence a vector
    x   [1]  1  2  3  4  5  6  7  8  9 10
    
    # Sequence an integer
    seq(nrow(mtcars))
    #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
    #> [24] 24 25 26 27 28 29 30 31 32
    

    The colon operator can produce unexpected results that can create all sorts of problems without you noticing! Take a look at what happens when you want to sequence the length of an empty vector:

    # Empty vector
    x  [1] 1 0
    
    seq(x)
    #> integer(0)
    

    You’ll also notice that this saves you from using functions like length(). When applied to an object of a certain length, seq() will automatically create a sequence from 1 to the length of the object.

    2. vector() what you c()

    Next time you create an empty vector with c(), try to replace it with vector("type", length).

    # A numeric vector with 5 elements
    vector("numeric", 5)
    #> [1] 0 0 0 0 0
    
    # A character vector with 3 elements
    vector("character", 3)
    #> [1] "" "" ""
    

    Doing this improves memory usage and increases speed! You often know upfront what type of values will go into a vector, and how long the vector will be. Using c() means R has to slowly work both of these things out. So help give it a boost with vector()!

    A good example of this value is in a for loop. People often write loops by declaring an empty vector and growing it with c() like this:

    x 
    #> x at step 1 : 1
    #> x at step 2 : 1, 2
    #> x at step 3 : 1, 2, 3
    #> x at step 4 : 1, 2, 3, 4
    #> x at step 5 : 1, 2, 3, 4, 5
    

    Instead, pre-define the type and length with vector(), and reference positions by index, like this:

    n 
    #> x at step 1 : 1, 0, 0, 0, 0
    #> x at step 2 : 1, 2, 0, 0, 0
    #> x at step 3 : 1, 2, 3, 0, 0
    #> x at step 4 : 1, 2, 3, 4, 0
    #> x at step 5 : 1, 2, 3, 4, 5
    

    Here’s a quick speed comparison:

    n     user  system elapsed 
    #>  16.147   2.402  20.158
    
    x_zeros     user  system elapsed 
    #>   0.008   0.000   0.009
    

    That should be convincing enough!

    3. Ditch the which()

    Next time you use which(), try to ditch it! People often use which() to get indices from some boolean condition, and then select values at those indices. This is not necessary.

    Getting vector elements greater than 5:

    x  5)]
    #> [1] 6 7
    
    # No which
    x[x > 5]
    #> [1] 6 7
    

    Or counting number of values greater than 5:

    # Using which
    length(which(x > 5))
    #> [1] 2
    
    # Without which
    sum(x > 5)
    #> [1] 2
    

    Why should you ditch which()? It’s often unnecessary and boolean vectors are all you need.

    For example, R lets you select elements flagged as TRUE in a boolean vector:

    condition  5
    condition
    #> [1] FALSE FALSE FALSE  TRUE  TRUE
    x[condition]
    #> [1] 6 7
    

    Also, when combined with sum() or mean(), boolean vectors can be used to get the count or proportion of values meeting a condition:

    sum(condition)
    #> [1] 2
    mean(condition)
    #> [1] 0.4
    

    which() tells you the indices of TRUE values:

    which(condition)
    #> [1] 4 5
    

    And while the results are not wrong, it’s just not necessary. For example, I often see people combining which() and length() to test whether any or all values are TRUE. Instead, you just need any() or all():

    x  10)) > 0)
      print("At least one value is greater than 10")
    #> [1] "At least one value is greater than 10"
    
    # Wrapping a boolean vector with `any()`
    if (any(x > 10))
      print("At least one value is greater than 10")
    #> [1] "At least one value is greater than 10"
    
    # Using `which()` and `length()` to test if all values are positive
    if (length(which(x > 0)) == length(x))
      print("All values are positive")
    #> [1] "All values are positive"
    
    # Wrapping a boolean vector with `all()`
    if (all(x > 0))
      print("All values are positive")
    #> [1] "All values are positive"
    

    Oh, and it saves you a little time…

    x  .5)])
    #>    user  system elapsed 
    #>   1.245   0.486   1.856
    
    system.time(x[x > .5])
    #>    user  system elapsed 
    #>   1.085   0.395   1.541
    

    4. factor that factor!

    Ever removed values from a factor and found you’re stuck with old levels that don’t exist anymore? I see all sorts of creative ways to deal with this. The simplest solution is often just to wrap it in factor() again.

    This example creates a factor with four levels ("a", "b", "c" and "d"):

    # A factor with four levels
    x  [1] a b c d
    #> Levels: a b c d
    
    plot(x)
    

    If you drop all cases of one level ("d"), the level is still recorded in the factor:

    # Drop all values for one level
    x  [1] a b c
    #> Levels: a b c d
    
    plot(x)
    

    A super simple method for removing it is to use factor() again:

    x  [1] a b c
    #> Levels: a b c
    
    plot(x)
    

    factor-3-1.png

    This is typically a good solution to a problem that gets a lot of people mad. So save yourself a headache and factor that factor!

    Aside, thanks to Amy Szczepanski who contacted me after the original publication of this article and mentioned droplevels(). Check it out if this is a problem for you!

    5. First you get the $, then you get the power

    Next time you want to extract values from a data.frame column where the rows meet a condition, specify the column with $ before the rows with [.

    Examples

    Say you want the horsepower (hp) for cars with 4 cylinders (cyl), using the mtcars data set. You can write either of these:

    # rows first, column second - not ideal
    mtcars[mtcars$cyl == 4, ]$hp
    #>  [1]  93  62  95  66  52  65  97  66  91 113 109
    
    # column first, rows second - much better
    mtcars$hp[mtcars$cyl == 4]
    #>  [1]  93  62  95  66  52  65  97  66  91 113 109
    

    The tip here is to use the second approach.

    But why is that?

    First reason: do away with that pesky comma! When you specify rows before the column, you need to remember the comma: mtcars[mtcars$cyl == 4,]$hp. When you specify column first, this means that you’re now referring to a vector, and don’t need the comma!

    Second reason: speed! Let’s test it out on a larger data frame:

    # Simulate a data frame...
    n  .5, ]$a)
    #>    user  system elapsed 
    #>   0.559   0.152   0.758
    
    # column first, rows second - much better
    system.time(d$a[d$b > .5])
    #>    user  system elapsed 
    #>   0.093   0.013   0.107
    

    Worth it, right?

    Still, if you want to hone your skills as an R data frame ninja, I suggest learning dplyr. You can get a good overview on the dplyr website or really learn the ropes with online courses like DataCamp’s Data Manipulation in R with dplyr.

    Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    To leave a comment for the author, please follow the link and comment on their blog: blogR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Looking back in 2017 and plans for 2018

    By Marcelo S. Perlin

    (This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

    As we come close to the end of 2017, its time to look back. This has
    been a great year for me in many ways. This blog started as a way to
    write short pieces about using R for finance and promote my
    book in an organic way.
    Today, I’m very happy with my decision. Discovering and trying new
    writing styles keeps my interest very alive. Academic research is very
    strict on what you can write and publish. It is satisfying to see that I
    can promote my work and have an impact in different ways, not only
    through the publication of academic papers.

    My blog is build using a Jekyll
    template
    , meaning the whole
    site, including individual posts, is built and controlled with editable
    text files and Github. All files related to posts follow the same
    structure, meaning I can easily gather the textual data and organize it
    in a nice tibble. Let’s first have a look in all post files:

    post.folder 

    I posted 26 posts during 2017. Notice how all dates are in the beginning
    of the file name. I can easily convert that to a Date object using
    as.Date. Let’s organize it all in a nice tibble.

    library(tidyverse)
    
    ## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
    
    ## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
    ## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
    ## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
    ## ✔ readr   1.1.1     ✔ forcats 0.2.0
    
    ## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    
    df.posts %  # includes output code in length calculation..
      filter(ref.date > as.Date('2017-01-01') | ref.date  2017-01-15, 2017-01-16, 2017-01-17, 2017-01-18, 2...
    ## $ ref.month    "01", "01", "01", "01", "01", "01", "02", "02", "0...
    ## $ content      "---nlayout: postntitle: "My first post!"nsub...
    ## $ char.length  1734, 5833, 6632, 17265, 23414, 12974, 18899, 1779...
    

    Fist, let’s look at the frequency of posts by month:

    print( ggplot(df.posts, aes(x = ref.month)) + geom_histogram(stat='count')) 
    
    ## Warning: Ignoring unknown parameters: binwidth, bins, pad
    

    It is not accidental that january was the month with the highest number
    of posts. This is when I had material reserved for the book. June and
    July (0!) were the worst months as I traveled a lot. In June I attended
    R and Finance in Chicago, SER in Rio de Janeiro and in July I was
    visiting Goethe University in Germany for the whole month. On average, I
    created 2.1666667 posts per month overall, which fells quite alright. I
    hope I can keep that pace for the upcoming years.

    As for the length of posts, below we can see a nice pattern for its
    distribution conditional on the months of the year.

    print(ggplot(df.posts, aes(x=ref.month, y = char.length)) + geom_boxplot())
    

    I was not very productive from may to august, writing a few and short
    posts, when comparing to other months. This was probably due to my
    travels.

    Plans for 2018

    Despite the usual effort in research and teaching, my plans for 2018
    are:

    • Work on the second edition of the portuguese
      book
      . It significantly
      lags the english version in content and this need to be fixed. I
      already have some ideas laid out for new chapters and new packages
      to cover. I’ll write more about this update as soon as I have it
      figured out.

    • Start a portal for financial data in Brazil. I want to make it
      easy for people to visualize and download organized financial data,
      specially those without programming experience. It will include the
      usual datasets such as prices in equity/bond/derivative markets for
      various frequencies, historical yield curves, financial statements
      of companies, and so on. The idea is to offer the datasets in
      various file formats, facilitating its use in research.

    Thats it. If you got this far, happy new year! Enjoy your family and the
    holidays!

    To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    2017. Quantified. In. R.

    By hrbrmstr

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    2017 is nearly at an end. We humans seem to need these cycles to help us on our path forward and have, throughout history, used these annual demarcation points as a time of reflection of what was, what is an what shall come next.

    To that end, I decided it was about time to help quantify a part of the soon-to-be previous annum in R through the fabrication of a reusable template. Said template contains various incantations that will enable the wielder to enumerate their social contributions on:

    • StackOveflow
    • GitHub
    • Twitter
    • WordPress

    through the use of a parameterized R markdown document.

    The result of one such execution can be found here (for those who want a glimpse of what I was publicly up to in 2017).

    Want to see where you contributed the most on SO? There’s a vis for that:

    What about your GitHub activity? There’s a vis for that, too:

    Perhaps you just want to see your top blog posts for the year. There’s also a vis for that:

    Or — maybe — you just want to see how much you blathered on Twitter. There’s even a vis for that:

    Take the Rmd for a spin. File issues & PRs for things that need work and take some time to look back on 2017 with a more quantified eye than you may have in years’ past.

    Here’s to 2018 being full of magic, awe, wonder and delight for us all!

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    My #Best9of2017 tweets

    By Maëlle Salmon

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    You’ve probably seen people posting their #Best9of2017, primarily on Instagram I’d say. I’m not an Instagram user, although I do have an account to spy on my younger sister and cousins, so I don’t even have 9 Instagram posts in total but I do love the collage people get to show off… So what about my best 9 tweets of 2017?

    Get my 9 best tweets by number of likes

    I first wanted to use rtweet::get_timeline but it only returned me tweets from July, even when using include_rts = FALSE, so I downloaded my analytics files from the Twitter website, one per trimester.

    my_files  c("tweet_activity_metrics_ma_salmon_20170101_20170402_en.csv",
                  "tweet_activity_metrics_ma_salmon_20170402_20170702_en.csv",
                  "tweet_activity_metrics_ma_salmon_20170702_20171001_en.csv",
                  "tweet_activity_metrics_ma_salmon_20171001_20171231_en.csv")
    paths  paste0("data/", my_files)
    # read them all at once
    my_tweets  purrr::map_df(paths, readr::read_csv)
    # just in case I got some data ranges wrong
    my_tweets  unique(my_tweets)
    # get the top 9!
    my_tweets  dplyr::arrange(my_tweets, - likes)
    my_tweets  janitor::clean_names(my_tweets)
    
    best9  my_tweets$tweet_permalink[1:9]
    

    My husband advised me to use something more elaborate than number of likes, which is a wise idea, but I was happy with that simple method.

    Take screenshots and paste them

    There’s a great R package to do screenshots from R, webshot. I was a bit annoyed at the “Follow” button appearing, but I did not want to have to write Javascript code to first login in the hope to make that thing disappear. I tried using a CSS selector instead of a rectangle, but I was less satisfied. An obvious problem here is that contrary to Instagram images, tweets have different heights depending on the text length and on the size of the optional attached media… It’s a bit sad but not too sad, my collage will still give a look at my Twitter 2017.

    library("magrittr")
    
    save_one_file  function(url, name){
      filename  paste0(name, ".png")
      # save and output filename
      webshot::webshot(url, filename,
                      cliprect = c(0, 150, 750, 750))
      filename
    }
    
    files  purrr::map2_chr(best9, 1:9, save_one_file)
    

    Regarding the collage part using magick, I used my “Faces of R” post as a reference, which is funny since it features in my top 9 tweets.

    no_rows  3
    no_cols  3
    
    make_column  function(i, files, no_rows){
      filename  paste0("col", i, ".jpg")
    
      magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>%
      magick::image_border("salmon", "20x20") %>%
      magick::image_append(stack = TRUE) %>%
        magick::image_write(filename)
      
      filename
    }
    
    purrr::map_chr(0:(no_cols-1), make_column, files = files,
        no_rows = no_rows) %>%
      magick::image_read() %>%
    magick::image_append(stack = FALSE) %>%
      magick::image_border("salmon", "20x20") %>%
      magick::image_write("2017-12-30-best9of2017.jpg")
    

    And since I’m well behaved, I clean after myself.

    # clean 
    file.remove(files)
    file.removes(paste0("col", 1:3, ".jpg"))
    

    So, apart from fun blog posts (Where to live in the US, Faces of #rstats Twitter, A plot against the CatterPlots complot) and more serious ones (Automatic tools for improving R packages, How to develop good R packages (for open science), Where have you been? Getting my Github activity), my top 9 tweets include a good ggplot2 tip (this website) and above all, the birth of baby Émile! Why have I only included two heart emojis?

    What about your 2017?

    I’ve now read a few awesome posts from other R bloggers, from the top of my head:

    What’s your take on your 2017, dear reader? In any case, I wish you a happy New Year!

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    The first step to becoming a top performing data scientist

    By Sharp Sight

    (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

    photo by https://unsplash.com/@joshuaearle

    Nearly every day, I see a new article talking about the benefits of data: “data will change the world” … “data is transforming business” … “data is the new oil.”

    Setting aside the hyperbolic language, most of this is true.

    So when you hear that “data scientist is the sexiest job of the 21st century,” you should mostly believe it. Companies are fighting to hire great data scientists.

    But there’s a catch.

    Even though there’s a huge demand for data scientists, a lot of people who study data science still can’t get jobs.

    I regularly hear from young data science students who tell me that they can’t get a job. Or they can get a “job,” but it’s actually an unpaid internship.

    What’s going on here?

    The dirty little secret is that companies are desperate for highly skilled data scientists.

    Companies want data scientists that are great at what they do. They want people who create more value than they cost in terms of salary.

    What this means is that to get a data science job, you actually need to be able to “get things done.”

    … and if you want a highly-paid data job, you need to be a top performer.

    I can’t stress this enough: if you want to get a great data science job, certificates aren’t enough. You need to become a top performer.

    Your first steps towards becoming a top performer

    Your first step towards becoming a top-performing data scientist is mastering the foundations:

    • data visualization
    • data manipulation
    • exploratory data analysis

    Have you mastered these? Have you memorized the syntax to accomplish these? Are you “fluent” in the foundations?

    If not, you need to go back and practice. Believe me. You’ll thank me later. (You’re welcome.)

    The reason is that these skills are used in almost every part of the data science workflow, particularly at earlier parts of your career.

    Given almost data task, you’ll almost certainly need to clean your data, visualize it, and do some exploratory data analysis.

    Moreover, they are also important as you move into more advanced topics. Do you want to start doing machine learning, artificial intelligence, and deep learning? You had better know how to clean and explore a dataset. If you can’t, you’ll basically be lost.

    “Fluency” with the basics … what does this mean?

    I want to explain a little more about what I mean by “master of the foundations.” By “mastery,” I mean something like “fluency.”

    As I’ve said before, programming languages are a lot like human languages.

    To communicate effectively and “get things done” in a language, you essentially need to be “fluent.” You need to be able to express yourself in that language, and you need to be able to do so in a way that’s accurate and performed with ease.

    Granted, you can “get by” without fluency, but you couldn’t expect to be hired for a language-dependent job without fluency.

    For example, do you think you could get a job as a journalist at the New York Times if you hadn’t mastered basic English grammar? Do you think you could get a job at the Wall Street Journal if you needed to look up 50% of the words you used?

    Of course not. If you wanted to be a journalist (in the USA), you would absolutely need to be fluent in English.

    Data science is similar. You can’t expect to get a paid job as a data scientist if you’re doing google searches for syntax every few minutes.

    If you eventually want a great job as a data scientist, you need to be fluent in writing data science code.

    Can you write this code fluently?

    Here’s an example. This is some code to analyze some data.

    Ask yourself, can you write this code fluently, from memory?

    #---------------
    # LOAD PACKAGES
    #---------------
    library(tidyverse)
    library(forcats)
    
    
    #------------------------------------------------------
    # BUILD DATASET
    # NOTE:
    # In this post, we will be using data from an analysis
    # performed by pwc
    # source: pwc.to/2totbnj
    #------------------------------------------------------
    
    df.ai_growth 
    

    You should be able to write most of this code fluently, from memory. You shouldn't have to use many google searches or external resources at all.

    Will you maybe forget a few things? Sure, every now and again. Will you write it all in one go? No. Even the best data scientists write code iteratively.

    But in terms of remembering the syntax, you should know most of this cold. You should know most of the syntax by memory.

    That's what fluency means.

    … and that's what it will take to be one of the best.

    Mastering data science is easier than you think

    I get it. This probably sounds hard.

    I don't want to lie to you. It's not “easy” in the sense that you can achieve “fluency” without any effort.

    But it is much easier than you think. With some discipline, and a good practice system, you can master the essentials within a couple of months.

    If you know how to practice, within a few months you can learn to write data science code fluently and from memory.

    Discover how to become a top-performing data scientist

    If you want to become a top-performing data scientist, then make sure you sign up for our email list.

    Next week, we will re-open enrollment for our data science training course, Starting Data Science.

    This course will teach you the essentials of data science in R, and give you a practice system that will enable you to memorize everything you learn.

    Want to become a top performer? Our course will show you how.

    Sign up for our email list and you'll get an exclusive invitation to join the course when it opens.

    SIGN UP NOW

    The post The first step to becoming a top performing data scientist appeared first on SHARP SIGHT LABS.

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Source:: R News

    Downtime Reading

    By R Views

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    Not everyone has the luxury of taking some downtime at the end the year, but if you do have some free time, you may enjoy something on my short list of downtime reading. The books and articles here are not exactly “light reading”, nor are they literature for cuddling by the fire. Nevertheless, you may find something that catches your eye.

    The Syncfusion series of free eBooks contains more than a few gems on a variety of programming subjects, including James McCaffrey’s R Programming Succinctly and Barton Poulson’s R Succinctly.

    For a more ambitious read, mine the rich vein of SUNY Open Textbooks. My pick is Hiroki Sayama’s Introduction to the Modeling and Analysis of Complex Systems.

    If you just can’t get enough of data science, then a few articles that caught my attention are:

    Starry Night through 6,667 uniform random samples

    Finally, if you really have some time on your hands, try searching through the 318M+ papers on PDFDRIVE.

    Happy reading, and have a Happy and Prosperous New Year from all of us at RStudio!!

    To leave a comment for the author, please follow the link and comment on their blog: R Views.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Working with PDFs – scraping the PASS budget

    By R on Locke Data Blog

    (This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

    Using tabulizer we’re able to extract information from PDFs so it comes in really handy when people publish data as a PDF! This post takes you through using tabulizer and tidyverse packages to scrape and clean up some budget data from PASS, an association for the Microsoft Data Platform community. The goal is to mainly show some of the tricks of the data wrangling trade that you may need to utilise when you scrape data from PDFs.

    To leave a comment for the author, please follow the link and comment on their blog: R on Locke Data Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Adding bananas from the commandline (extending the oomsifier)

    By Roel M. Hogervorst

    (This article was first published on Clean Code, and kindly contributed to R-bloggers)

    Sometimes you just want to add bananas from the commandline. Previously
    I created a small script that takes an image and adds a dancing banana to the bottom left of the image. I wanted to make an API too, but that will have to wait till next year. Today we will create a commandline script that will do the same thing.

    With the excellent explanation in Mark Sellors’ guide I have now created a cmdline thingy in very few steps.

    I can now add bananas from the commandline with:

    ./bananafy.R ../images/ggplotexample.png out.gif
    

    This executes and says:

    Linking to ImageMagick 6.9.7.4
    Enabled features: fontconfig, freetype, fftw, lcms, pango, x11
    Disabled features: cairo, ghostscript, rsvg, webp
    writing bananafied image to out.gif
    

    The modified script

    First the script itself, saved as bananafy.R

    #!/usr/bin/Rscript --vanilla
    args  commandArgs(trailingOnly = TRUE)
    if (length(args)  1){
        stop("I think you forgot to input an image and output name? n")
    }
    
    library(magick)
    ## Commandline version of add banana
    #banana 
    
    #add_banana 
    offset  NULL # maybe a third argument here would be cool?
    debug  FALSE
    image_in  magick::image_read(args[[1]])
    banana  image_read("../images/banana.gif") # 365w 360 h
    image_info  image_info(image_in)
    if("gif" %in% image_info$format ){stop("gifs are to difficult for  me now")}
    stopifnot(nrow(image_info)==1)
    # scale banana to correct size:
    # take smallest dimension.
    target_height  min(image_info$width, image_info$height)
    # scale banana to 1/3 of the size
    scaling   (target_height /3)
    front  image_scale(banana, scaling)
    # place in lower right corner
    # offset is width and hieight minus dimensions picutre?
    scaled_dims  image_info(front)
    x_c  image_info$width - scaled_dims$width
    y_c  image_info$height - scaled_dims$height
    offset_value  ifelse(is.null(offset), paste0("+",x_c,"+",y_c), offset)
    if(debug) print(offset_value)
    frames  lapply(as.list(front), function(x) image_composite(image_in, x, offset = offset_value))
    
    result  image_animate(image_join(frames), fps = 10)
    message("writing bananafied image to", args[[2]])
    image_write(image = result, path = args[[2]])
    

    As you might notice I copied the entire thing from the previous post and added some extra Things

    • It starts with ‘#!/usr/bin/Rscript’

    According to Mark:

    Sometimes called a ‘shebang’, this line tells the Linux and MacOS command line interpreters (which both default to one called ‘bash’), what you want to use to run the rest of the code in the file. ….The –vanilla on the end, tells Rscript to run without saving or restoring anything in the process. This just keeps things nice a clean.

    I’ve added a message call that tells me where the script saves the image. I could have suppressed the magic messages, but meh, it is a proof of concept.

    To make it work, you have to tell linux (which I’m working on) that it can execute the file. That means changing the permissions on that file

    In the terminal you go to the projectfolder and type chmod +x bananafy.R. You CHange MODe by adding (+) eXecution rights to that file.

    advanced use: making bananafy options available always and everywhere in the terminal.

    We could make the bananafyer available to you always in in every folder. T do that you could move the script to f.i. ~/scripts/, modify the code a bit and add the bananagif to that same folder. You then have to modify your bashrc file.

    • I had to make the link to the banana hardcoded: ‘~/scripts/images/banana.gif’
    • you can call the code from anywhere and the output of the script will end up in the folder you currently are in. So if I’m in ~/pictures/reallynicepictures the bananafied image will be there.

    Adding bananas from the commandline (extending the oomsifier) was originally published by at Clean Code on December 29, 2017.

    To leave a comment for the author, please follow the link and comment on their blog: Clean Code.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News