How to query large JSON file with Dremio and Exploratory

In the previous blog post, I explained how you can setup Dremio and access it from Exploratory. In this post, I’m going to walk you through how you can query large JSON file whose size is more than 1 GB on your local machine with Dremio and Exploratory.

Create Data Set on Dremio

In this example, I’m going to use Yelp Dat Set to show you a way to work with large data set. In Yelp Data Set, user JSON file’s size is 1.57 GB. So let’s use the file for now.

Upload JSON file.

Login to your local Dremio, and click Purple icon on right hand side top.

Select Yelp’s user.json (file size 1.57GB) and click Next

Check the preview and click Save button.

You can see that user JSON file is registered as a physical dataset (purple color icon on Dremio UI) user under you home (i.e. @hidekoji in my case). And this means that it’s ready query this JSON data with SQL from Exploratory.

Query JSON File on Dremio from Exploratory

Let’s query this JSON data set on Dremio from Exploratory.

On left hand side tree, click plus (+) button next to Data Frames label and select Database Data.

On Data Import Dialog, select Dremio Connection (i.e. Dremio Local Mac) that you created and expand your home directory (in my case @hidekoji) and you can see user data set under it that you uploaded to Dremio.

Click the Table name, which would automatically generate a SQL query to get the whole data. By clicking Preview button, you will see the data returned in Exploratory ! And you’ll notice it has around 1.18 million rows.

select * from "@hidekoji"."user"

Let’s say out of 1.18 million users, you only want to see users who joined Yelp after “2015-01-01” and have more than 20 fans and more than 50 review counts. Well you can write a SQL query like this.

select * from "@hidekoji"."user"
where fans > 20
and review_count > 50
and yelping_since => '2015-01-01'

And now the number of users that match the condition becomes 228 with just a few lines of SQL query. If you’re familiar with SQL query, it’s really powerful that you can filter your JSON file with SQL query like this!

Let’s import it to Exploratory and analyze the Yelp users data. Click Import button and set some name to your data frame like yelp_popular_user_latest and click create button.

Then data is imported into Exploratory and you can see summary information like this.

Advanced Analytics (K-means Clustering)

With this data, we can build a clustering model to cluster these Yelp users into a number of groups based on some attributes presented in other columns. We can use K-means clustering algorithm for this.

Normalize Data

First, normalize numeric columns to prevent big numbers would play big roles compared to other columns. You can normalize column from column menu Work with Numeric Function -> normalize.

K-means Clustering

Click plus button and select Run Analytics... -> Cluster with K-means

In the Cluster K-means dialog, select review_count, userful, funny, cool, fans, average_stars columns and set Number of Clusters as 4.

Once calculation is finished, a new column called cluster is added to the data frame like below. This cluster column holds each cluster number ranging from 1 to 4 because we specify 4 as number of cluster.

Let’s gather review_count, userful, funny, cool, fans, average_stars to key and value columns like this.

Then you can visualize the clustering with a boxplot chart by assigning cluster to X-Axis and value to Y-Axis and key column to Color By. This boxplot chart helps you to figure out characteristics of each cluster.

Loading...

Summary

As you have seen, thanks to Dremio, you can use SQL to query against JSON data and get only the necessary data by filtering. I didn’t show it in this post, but of course you can do all sorts of things with SQL such as aggregating the data, selecting only the columns you want, etc. And this would be very useful especially when you have large JSON data that is hard to fit into your PC’s memory.


If you don’t have Exploratory Desktop yet, you can sign up from here for a 30 days free trial. If you are currently a student or teacher, then it’s free!