Regrid Blog

Regrid’s nationwide parcel data, now available in Parquet format!

Written by Regrid Product Team | Oct 31, 2022 3:26:41 PM

We are very excited to announce that our nationwide parcel data in Premium Schema is now available in Apache Parquet format, delivered via SFTP, for all the data modelers, analysts and scientists who prefer a columnar file format. You asked, and we answered!

 

Parquet is a column based file format that is often used for data pipelines. Part of the Apache Hadoop ecosystem, it is designed to be compact and efficient for large scale data analysis.

Popular amongst the Python community, Parquet is ideal for fast querying and consuming less storage space and as a result yields itself naturally for large datasets such as our nationwide parcel data. AWS once said that Parquet is “2x faster to unload and consumes up to 6x less storage in Amazon S3, compared to text formats”.

 

 

What is Parquet and why is it valuable as a data file format?

Parquet is a columnar file format, which is fundamentally different from row-based formats such as CSV. It is free and open source & is self-describing - typically containing the metadata & schema. 

  1. Columnar - The file format is columnar - which means that data is encoded and stored by columns instead of by rows. This allows the user to restrict disk i/o(input/output) operations to a minimum.
    In contrast, the popular file format CSV is row-oriented. Row-oriented formats are optimized for OLTP workloads while column-oriented formats are better suited for analytical workloads.


    (Image source: TowardsDataScience)
  2. High-performant analytical querying - Unlike row-based file formats, Parquet is optimized for performance. When running queries on Parquet, one can query only on fewer fields very quickly, instead of the entire schema. Moreover, the amount of data scanned will be way smaller and will result in less I/O usage. Parquet is a self-described format, so each file contains both data and metadata. This structure is well-optimized both for fast query performance & saving space.
  3. Schema customization - We deliver all our premium schema columns in our Parquet files. That said, as mentioned above, our customers can decide to load the most relevant fields from our schema to start off with, allowing for cleaner, faster querying. Our users can gradually choose to add more fields.

When does it make sense to use a Parquet file format?

  1. Data modeling, AI/ML/DL use cases that require working with very large datasets - Since Parquet is inherently optimized for faster querying performance and compression, it's a format of choice for analysis and modeling in languages like Python. It saves time, space and allows very focused, tailored querying. Common big data systems like BigQuery and Snowflake can ingest Parquet files. One of the reasons why many of our data modeling customers requested we provide our in this format, and voila, we now have it for you.
  2. Use cases that require working with subsets of columns from a large dataset schema - We pride ourselves on the many fields we offer in our parcel schema. That said, depending on the use case, we often have customers working with a core subset of columns most of the time and this subset varies from use case to use case, sometimes within the same team. Parquet’s column-oriented storage makes it ideal for querying only on relevant data. 
  3. Cut down querying cost if your team uses multiple analytics services - Most DBs charge by the amount of data scanned per query and this multiplies when using multiple services and datasets. The faster querying and compression in Parquet not only helps in performance and speed but also helps cut costs.

How to get our data in the Parquet format?

At the moment, this file format is available for our Premium schema customers via SFTP, in their regular download directories. However,  it’s not one of the readily available formats in our data store yet. However, please let our team know when you reach out to evaluate our nationwide data that you’d like to sample it in Parquet and we will get that set up for you.

 

There is a full list of file formats that Regrid supports available on our site. Multiple formats for your multiple use cases!

 

If you are interested in our parcel data or would like to start a conversation with us to evaluate our data for your project, please reach out to us at parcels@regrid.com .

 

 

Works Cited

Berk, Michael. “Demystifying the Parquet File Format | by Michael Berk.” Towards Data Science,

https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705. Accessed 31 October 2022.

Levy, Eran. “What is the Parquet File Format? Use Cases & Benefits.” Upsolver, 27 February 2022,

https://www.upsolver.com/blog/apache-parquet-why-use. Accessed 31 October 2022.