Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Property - Casualty / General Insurance
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Reply
 
Thread Tools Display Modes
  #11  
Old 11-14-2017, 09:08 AM
GwenAnderson GwenAnderson is offline
CAS
 
Join Date: Nov 2017
Posts: 10
Default

Hi, I am new to outpost and I am wondering if there should be a new category for code problems? I had not thought about code review as an ongoing board discussion. Pragmatist, I see what you are saying about the data. I hadn't thought of the downloading of the data as being a problem; I just do it while I am reading or opening my paper mail. But yes, you could download three years of data and run them through the loop. The code is short (half page to one page each) and is intended for beginners to access data sets of raw weather data. Yes, I will incorporate more features of data.table than I have presently, by your suggestions. The intent is for the code to be extremely simple. The data source is here: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
and since there seems to be a lot of enthusiasm for code here, and not for accessing data, I will ask about the download instructions in Linux: Download the following TAR file:"ghcnd-all.tar.gz" if you want all of GHCN-Daily. Then uncompress and untar the contents of the tar file,
e.g., by using the following Linux command: tar xzvf ghcnd_all.tar.gz
I do not use Linux so I do not know how to do this on my computer. Do you? Are you qualified IT? Since I am not an IT person, if something will take the day to figure out, and I can't refer it to IT, I will just put it in Excel or Access or R and do it another way. Will anyone see this post as a reply or do I need to begin a new post? My intention here was to seek reviewers.
Reply With Quote
  #12  
Old 11-14-2017, 10:57 AM
Vorian Atreides's Avatar
Vorian Atreides Vorian Atreides is offline
Wiki/Note Contributor
CAS
 
Join Date: Apr 2005
Location: Hitler's Secret Bunker
Studying for ACAS
College: Hard Knocks
Favorite beer: Sam Adams Cherry Wheat
Posts: 59,349
Default

There is already a software sub-forum that this sort of question could also be posed/discussed.

But if there is an intended audience of the focus (e.g., P&C applications or Life/Health application), then posting in the targeted area would also be appropriate.
__________________
I find your lack of faith disturbing

Why should I worry about dying? Itís not going to happen in my lifetime!


Freedom of speech is not a license to discourtesy

#BLACKMATTERLIVES
Reply With Quote
  #13  
Old 11-14-2017, 11:37 AM
AMedActuary AMedActuary is offline
Member
SOA
 
Join Date: May 2007
College: UCLA Alumni
Posts: 367
Default

In the loop, are you creating any temporary large variables over and over? If so, you probably want to remove them with rm() and run the garbage collector gc() at the end of the loop. This will remove the temporary variable from memory and free up the RAM so that it can be used in the future. I believe garbage collection runs automatically but it may take some time, meanwhile you're wasting a lot of RAM space.

Also, pre-allocation of an empty matrix then filling it with data is more efficient than adding to a matrix inside of a loop (using cbind() or something like that). If you can pre-allocate, that is more efficient on your memory usage.
Reply With Quote
  #14  
Old 11-14-2017, 03:45 PM
Heywood J Heywood J is offline
Member
CAS
 
Join Date: Jun 2006
Posts: 4,012
Default

I wouldn't use the data.table package, not if you're looking for a simple code. It is fast, but it's quirky and has awful syntax, in a way that is very different from base R's awful syntax. I think tidyverse packages like dplyr and readr are almost as fast as data.table in most applications, and are way easier to use effectively.
Reply With Quote
  #15  
Old 11-14-2017, 04:50 PM
Avi's Avatar
Avi Avi is offline
Wiki Contributor
Site Supporter
Site Supporter
CAS AAA
 
Join Date: Aug 2002
Location: NY
Studying for the rest of my life.
College: Alumnus - Queens College - CUNY
Favorite beer: Stone Ruination IPA
Posts: 13,904
Blog Entries: 3
Default

Quote:
Originally Posted by Heywood J View Post
I wouldn't use the data.table package, not if you're looking for a simple code. It is fast, but it's quirky and has awful syntax, in a way that is very different from base R's awful syntax. I think tidyverse packages like dplyr and readr are almost as fast as data.table in most applications, and are way easier to use effectively.
That is a matter of preference. I find data.table syntax easier to work with than dplyr and it is orders of magntude faster on the multi-hundred million line data sets with which I have to deal at times. Even on smaller data sets, I find I can extract and analyze more efficiently with data.table than with chaining or nesting dplyr calls. YMMV
__________________
All scientists defer only to physicists
Physicists defer only to mathematicians
Mathematicians defer only to G-d!

--with apologies to Dr. Leon Lederman
Reply With Quote
  #16  
Old 11-14-2017, 05:46 PM
kevinykuo kevinykuo is offline
CAS
 
Join Date: Nov 2017
Posts: 10
Default

Quote:
Originally Posted by Avi View Post
That is a matter of preference. I find data.table syntax easier to work with than dplyr and it is orders of magntude faster on the multi-hundred million line data sets with which I have to deal at times. Even on smaller data sets, I find I can extract and analyze more efficiently with data.table than with chaining or nesting dplyr calls. YMMV
If your dataset fits on memory in one node, data.table will always be faster, and more obviously so in the 10/100 million row range like you mentioned, but for scaling up to big data (stuff that won't fit in memory on one node) you'll have to use dplyr.

dplyr plays along better with the rest of the tidyverse though so that's also something to consider.
Reply With Quote
  #17  
Old 11-14-2017, 09:48 PM
Heywood J Heywood J is offline
Member
CAS
 
Join Date: Jun 2006
Posts: 4,012
Default

Quote:
Originally Posted by kevinykuo View Post
If your dataset fits on memory in one node, data.table will always be faster, and more obviously so in the 10/100 million row range like you mentioned, but for scaling up to big data (stuff that won't fit in memory on one node) you'll have to use dplyr.

dplyr plays along better with the rest of the tidyverse though so that's also something to consider.
Another factor to consider is that data.table is less safe in the hands of non-expert programmers. R generally avoids passing objects by reference, but data.table objects are pointers. Passing objects by reference is more efficient, but it introduces new ways to screw up in a complicated non-obvious way.
Reply With Quote
  #18  
Old 11-15-2017, 10:53 AM
GwenAnderson GwenAnderson is offline
CAS
 
Join Date: Nov 2017
Posts: 10
Default

Hi, good morning, you do have a nice conversation going about R packages, although you may be surprised that they are quite unrelated to the paper for which I am seeking review. I am not actually struggling with how to write the code or with choosing packages (since the code is mostly complete and fairly straightforward) - I am considering removing the bulk of this type of coding from the article since once the data is accessed there is quite a lot of summarizing and subsetting that can already be done in excel. The purpose of the article is towards using tools for climate analyses that are not available in Microsoft Suite. Vorian pointed out there is a code section of the outpost where I can place any code questions I have. I was curious if the Linux command would be helpful since there were questions about downloading data; as that is one I do not know how to use. I have not myself run into any trouble with downloading though. I have not had anyone sign up as a reviewer yet. If you did have the paper in hand you probably have good background in R to be able to contribute. By the review process you are placed on an email list and make comments with a group, however there is no specific requirements imposed other than to provide any comments in November.
Reply With Quote
  #19  
Old 11-15-2017, 10:54 AM
Avi's Avatar
Avi Avi is offline
Wiki Contributor
Site Supporter
Site Supporter
CAS AAA
 
Join Date: Aug 2002
Location: NY
Studying for the rest of my life.
College: Alumnus - Queens College - CUNY
Favorite beer: Stone Ruination IPA
Posts: 13,904
Blog Entries: 3
Default

Quote:
Originally Posted by kevinykuo View Post
If your dataset fits on memory in one node, data.table will always be faster, and more obviously so in the 10/100 million row range like you mentioned, but for scaling up to big data (stuff that won't fit in memory on one node) you'll have to use dplyr.

dplyr plays along better with the rest of the tidyverse though so that's also something to consider.
Matt Dowle (data.table project lead) now works for H20, but there isn't innate support for Spark/H20 in data.table just yet; that is planned. So you're correct in that if your data requires more than RAM, sparklyr is the way to go (to my chagrin I'm facing that now even with 532GB RAM ) but that will be addressed eventually and then we will be able to leverage the syntax (which for some is a problem but for me is a feature) and inherent speed of data.table on distributed data.

All that being said, it is a really exciting time in the R universe with all the development going on, isn't it?
__________________
All scientists defer only to physicists
Physicists defer only to mathematicians
Mathematicians defer only to G-d!

--with apologies to Dr. Leon Lederman
Reply With Quote
  #20  
Old 11-15-2017, 10:54 AM
Avi's Avatar
Avi Avi is offline
Wiki Contributor
Site Supporter
Site Supporter
CAS AAA
 
Join Date: Aug 2002
Location: NY
Studying for the rest of my life.
College: Alumnus - Queens College - CUNY
Favorite beer: Stone Ruination IPA
Posts: 13,904
Blog Entries: 3
Default

Quote:
Originally Posted by GwenAnderson View Post
Hi, good morning, you do have a nice conversation going about R packages, although you may be surprised that they are quite unrelated to the paper for which I am seeking review. I am not actually struggling with how to write the code or with choosing packages (since the code is mostly complete and fairly straightforward) - I am considering removing the bulk of this type of coding from the article since once the data is accessed there is quite a lot of summarizing and subsetting that can already be done in excel. The purpose of the article is towards using tools for climate analyses that are not available in Microsoft Suite. Vorian pointed out there is a code section of the outpost where I can place any code questions I have. I was curious if the Linux command would be helpful since there were questions about downloading data; as that is one I do not know how to use. I have not myself run into any trouble with downloading though. I have not had anyone sign up as a reviewer yet. If you did have the paper in hand you probably have good background in R to be able to contribute. By the review process you are placed on an email list and make comments with a group, however there is no specific requirements imposed other than to provide any comments in November.
Welcome to the Outpost, where thread hijacks are a feature, not a bug
__________________
All scientists defer only to physicists
Physicists defer only to mathematicians
Mathematicians defer only to G-d!

--with apologies to Dr. Leon Lederman
Reply With Quote
Reply

Tags
climate, meteorology, programming, r language, volunteer

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 10:01 PM.


Powered by vBulletin®
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.28412 seconds with 11 queries