Actuarial Outpost
 
Go Back   Actuarial Outpost > Actuarial Discussion Forum > Software & Technology
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions


Reply
 
Thread Tools Display Modes
  #1  
Old 09-21-2017, 01:55 PM
Colonel Smoothie's Avatar
Colonel Smoothie Colonel Smoothie is offline
Member
CAS
 
Join Date: Sep 2010
College: Jamba Juice University
Favorite beer: AO Amber Ale
Posts: 44,705
Default Best way to handle a few million JSON files?

I have a few million JSON files that I have to text mine, scattered across a bunch of folders. It's about 2 TB worth of data.

The data in each file are semi-structured, but the structure is consistent across all the files. Should I just mine the data in its native JSON format, or should I aggregate it all into 2 SQL tables and then text mine? Some of the fields have unstructured text, others are structured.
__________________
Recommended Readings for the EL Actuary || Recommended Readings for the EB Actuary

Quote:
Originally Posted by Wigmeister General View Post
Don't you even think about sending me your resume. I'll turn it into an origami boulder and return it to you.
Reply With Quote
  #2  
Old 09-21-2017, 04:32 PM
DiscreteAndDiscreet DiscreteAndDiscreet is offline
Member
AAA
 
Join Date: May 2016
Posts: 332
Default

XML, JSON, and Lisp S-expression data all have an underlying tree structure that has no simple equivalent in tabular data. If each record is identical in its structure, that doesn't matter, since you can flatten the data into a table and you would still be able to restore the original structure by outputting the fixed format for each row. Data transformations that are reversible are your friend because any time you use them, there isn't anything left behind in the previous format which isn't maintained in the translation. If you can establish that a data transformation only mangles data that you don't care about, use it if it makes your life significantly easier.

Evaluate the data on this basis and translate the data into a format that you are comfortable working with using methods that preserve original content that you need. You may need to review a statistically representative sample of the entire dataset to figure out what you need to worry about. It's up to you to figure out what statistically representative means, but it probably doesn't mean the first 100 records in the first file in each folder. Writing a program that picks a random record from a random file in a random directory may be a useful start in itself.

Working with unstructured and semi-structured text, you probably want to use regular expressions to identify categories of text values. I think quality of regular expression support varies by SQL implementation but I think anything but Access should have something decent. JS has good native support so that would be a point toward making JS a tool you're familiar with.

Remember that as time invested in a text processing project increases, the software written asymptotically approaches a parser for an informally specified grammar. There are in fact parser generators that will allow you to parse text from a formally specified grammar. There is some place to go if you start to feel like you are reinventing the wheel.
Reply With Quote
  #3  
Old 09-21-2017, 07:28 PM
Colonel Smoothie's Avatar
Colonel Smoothie Colonel Smoothie is offline
Member
CAS
 
Join Date: Sep 2010
College: Jamba Juice University
Favorite beer: AO Amber Ale
Posts: 44,705
Default

Quote:
Originally Posted by DiscreteAndDiscreet View Post
XML, JSON, and Lisp S-expression data all have an underlying tree structure that has no simple equivalent in tabular data. If each record is identical in its structure, that doesn't matter, since you can flatten the data into a table and you would still be able to restore the original structure by outputting the fixed format for each row. Data transformations that are reversible are your friend because any time you use them, there isn't anything left behind in the previous format which isn't maintained in the translation. If you can establish that a data transformation only mangles data that you don't care about, use it if it makes your life significantly easier.

Evaluate the data on this basis and translate the data into a format that you are comfortable working with using methods that preserve original content that you need. You may need to review a statistically representative sample of the entire dataset to figure out what you need to worry about. It's up to you to figure out what statistically representative means, but it probably doesn't mean the first 100 records in the first file in each folder. Writing a program that picks a random record from a random file in a random directory may be a useful start in itself.

Working with unstructured and semi-structured text, you probably want to use regular expressions to identify categories of text values. I think quality of regular expression support varies by SQL implementation but I think anything but Access should have something decent. JS has good native support so that would be a point toward making JS a tool you're familiar with.

Remember that as time invested in a text processing project increases, the software written asymptotically approaches a parser for an informally specified grammar. There are in fact parser generators that will allow you to parse text from a formally specified grammar. There is some place to go if you start to feel like you are reinventing the wheel.
The JSON structure only has a couple of attributes that are nested, and 1 level deep. The attributes I want to mine are only on the first level, so I ought to be able to aggregate everything on the first level into a table of sorts. For the nested attributes, I can put those on other tables and use the parent ID to link them.

Alternatively, my coworker suggested I just leave the files as is and see if there are any modern tools I can use that lets me skip the aggregation phase. I'm more familiar with working with tables, though.
__________________
Recommended Readings for the EL Actuary || Recommended Readings for the EB Actuary

Quote:
Originally Posted by Wigmeister General View Post
Don't you even think about sending me your resume. I'll turn it into an origami boulder and return it to you.
Reply With Quote
  #4  
Old 09-21-2017, 08:09 PM
DiscreteAndDiscreet DiscreteAndDiscreet is offline
Member
AAA
 
Join Date: May 2016
Posts: 332
Default

Quote:
Originally Posted by Colonel Smoothie View Post
The JSON structure only has a couple of attributes that are nested, and 1 level deep. The attributes I want to mine are only on the first level, so I ought to be able to aggregate everything on the first level into a table of sorts. For the nested attributes, I can put those on other tables and use the parent ID to link them.

Alternatively, my coworker suggested I just leave the files as is and see if there are any modern tools I can use that lets me skip the aggregation phase. I'm more familiar with working with tables, though.
The advantages of JSON are that it's easy to parse and generate and it can represent a wide range of structured data in an effective manner. Those are basically the same advantages as Lisp S-expressions which date back to 1958. Nothing in CS is ever really actually new and modern, it's just finding modern applications for a set of well studied techniques. In light of that, I'd say that your instinct to go with what you know is probably right. There's not going to be anything that will give you an easier way to execute the queries you already have in mind.

There may be some optimizations to the processing time if you take particular approaches, but that's probably not going to be worthwhile unless this is your full time job now.
Reply With Quote
  #5  
Old 09-26-2017, 10:19 PM
OME76 OME76 is offline
Member
Non-Actuary
 
Join Date: Jul 2012
Posts: 94
Default

I've been processing json and xml files alot in the last year or so. Usually to excel or csv files because those are what coworkers want to see and are most comfortable with.

Discreteanddistreet really seems to know what he's doing. I would take his advice.
Reply With Quote
  #6  
Old 09-26-2017, 10:23 PM
OME76 OME76 is offline
Member
Non-Actuary
 
Join Date: Jul 2012
Posts: 94
Default

Just make sure that when you're flattening the output, the "highest" level of sort is by filename or some other way of identifying each file.
Reply With Quote
  #7  
Old 09-27-2017, 08:46 PM
Colonel Smoothie's Avatar
Colonel Smoothie Colonel Smoothie is offline
Member
CAS
 
Join Date: Sep 2010
College: Jamba Juice University
Favorite beer: AO Amber Ale
Posts: 44,705
Default

Quote:
Originally Posted by OME76 View Post
I've been processing json and xml files alot in the last year or so. Usually to excel or csv files because those are what coworkers want to see and are most comfortable with.

Discreteanddistreet really seems to know what he's doing. I would take his advice.
It took a few tries, but I've been able to upload a few files into SQL server without a hiccup, so I think I have the schema right.

However, I calculated the amount of time I would need to upload a few million files, and it's just going to take way too long (~2 months of continuous uploading). I'm trying to experiment with various parallel upload methods to get everything uploaded. Each file itself contains about 40,000 to 60,000 records. I'm working exclusively in the cloud on Azure.
__________________
Recommended Readings for the EL Actuary || Recommended Readings for the EB Actuary

Quote:
Originally Posted by Wigmeister General View Post
Don't you even think about sending me your resume. I'll turn it into an origami boulder and return it to you.
Reply With Quote
  #8  
Old 09-28-2017, 07:30 AM
Incredible Hulctuary's Avatar
Incredible Hulctuary Incredible Hulctuary is offline
Member
Non-Actuary
 
Join Date: Jan 2002
Posts: 23,657
Default

Quote:
Originally Posted by Colonel Smoothie View Post
Each file itself contains about 40,000 to 60,000 records. I'm working exclusively in the cloud on Azure.
Millions of files, each with tens of thousands of records = tens of billions of records, and you're trying parse and upload all this to the cloud? You will go

Do they (Azure/MS) have some kind of hard drive seeding service? If they do, I'd be looking for a way to make use of that, either by loading the data into a local database then shipping the database to them, or if that isn't feasible I'd ship a hard drive with all the JSON files to them and load the files into a cloud database from there (after perfecting the file parsing/loading routine locally).
__________________
A lot of people are afraid of heights; not me, I'm afraid of widths. - Steven Wright
Reply With Quote
  #9  
Old 09-28-2017, 08:37 AM
Colonel Smoothie's Avatar
Colonel Smoothie Colonel Smoothie is offline
Member
CAS
 
Join Date: Sep 2010
College: Jamba Juice University
Favorite beer: AO Amber Ale
Posts: 44,705
Default

Quote:
Originally Posted by Incredible Hulctuary View Post
Millions of files, each with tens of thousands of records = tens of billions of records, and you're trying parse and upload all this to the cloud? You will go

Do they (Azure/MS) have some kind of hard drive seeding service? If they do, I'd be looking for a way to make use of that, either by loading the data into a local database then shipping the database to them, or if that isn't feasible I'd ship a hard drive with all the JSON files to them and load the files into a cloud database from there (after perfecting the file parsing/loading routine locally).
Everything's already in the cloud. Azure has a 1.5 gigabit connection, so I downloaded the original data into the cloud via bittorrent using a virtual machine. So I've got a few TB's of data sitting in Azure storage, but moving it all from JSON to a sql database is now the next hurdle.

Throughput has been really slow, I'm thinking it might have something to do with the formatting because there are tutorials for uploading at a rate of 1TB/15min using other formats.
__________________
Recommended Readings for the EL Actuary || Recommended Readings for the EB Actuary

Quote:
Originally Posted by Wigmeister General View Post
Don't you even think about sending me your resume. I'll turn it into an origami boulder and return it to you.
Reply With Quote
  #10  
Old 10-09-2017, 10:11 PM
Colonel Smoothie's Avatar
Colonel Smoothie Colonel Smoothie is offline
Member
CAS
 
Join Date: Sep 2010
College: Jamba Juice University
Favorite beer: AO Amber Ale
Posts: 44,705
Default

Turns out storing data on SQL server is just going to be cost prohibitive. We're turning towards just leaving the JSONs as is and using a spark cluster to mine them. I've found that pyspark reads the format pretty well. I just need to get used to functional programming aspects to manipulate the data.
__________________
Recommended Readings for the EL Actuary || Recommended Readings for the EB Actuary

Quote:
Originally Posted by Wigmeister General View Post
Don't you even think about sending me your resume. I'll turn it into an origami boulder and return it to you.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 04:03 AM.


Powered by vBulletin®
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
*PLEASE NOTE: Posts are not checked for accuracy, and do not
represent the views of the Actuarial Outpost or its sponsors.
Page generated in 0.23660 seconds with 9 queries