Keep Parquet and ORC from the data graveyard with new BigQuery features

Exploring container security: Use your own keys to protect your data on GKE
October 31, 2019
How GCP helps you take command of your threat detection
October 31, 2019
Exploring container security: Use your own keys to protect your data on GKE
October 31, 2019
How GCP helps you take command of your threat detection
October 31, 2019

“At Pandora, we have petabytes of data spread across multiple Google Cloud storage services; accordingly, we expect BigQuery’s federated query capability to be a useful tool for integrating our diverse data assets into a unified analytics ecosystem,” says Greg Kurzhals, product manager at Pandora. “The support for Parquet and other external data source formats will give us the ability to choose the best underlying storage option for each use case, while still surfacing all our data within a centralized, BigQuery-based data lake optimized for analytics and insights.” Gaining this flexibility eliminated some difficult architectural trade-offs, helping to simplify the design process and ultimately facilitate the creation of an efficient, accessible data structure in the cloud for the music services company.

When Cardinal Health started their journey to the cloud, they chose a lift-and-shift strategy, migrating all of their Hadoop jobs to run in Cloud Dataproc. “We also wanted to leverage cloud-native options like BigQuery but without necessarily rewriting our entire ingestion pipeline,” says Ken Flannery, senior enterprise architect at Cardinal Health. “We needed a quick and cost-effective way to allow our users the flexibility of using different compute options (BigQuery or Hive) without necessarily sacrificing performance or data integrity. Adding ORC federation support to BigQuery was exactly what we needed and was timed perfectly for our migration.”

As soon as Cardinal Health started migrating jobs to Cloud Dataproc, the same datasets that users were already querying from Cloud Dataproc were now simultaneously available to them in BigQuery. “ORC federation helped us take advantage of BigQuery much sooner than otherwise possible and gave us the needed flexibility of choosing when and how much of BigQuery we would use,” says Flannery.

Software company Truecaller was using Hive/Spark to query data before it tested external table support on the columnar format–but it was slower and cost twice as much. They were working on onboarding teams to BigQuery quickly, so they decided to try external tables vs. managed tables. “We were impressed by how convenient it was: There is zero setup cost and it is incredibly simple,” says Juliana Araujo, data product manager at Truecaller. “All we had to do was set the Cloud Storage URL path to our data and make a permanent table that references the data source. Now we can have our EDWH and data lake under the same stack.

The greatest benefit of using BigQuery external tables for Truecaller is that it provides unprecedented opportunity to do ad-hoc analysis on enormous datasets that we don’t want to store in BigQuery and are too big for usual Hadoop processing.” This has saved hours of time for Truecaller. For example, in one use case, querying external tables was 30 times faster than querying Hive/Spark in the Truecaller data platform.

With the release of querying Parquet and ORC files in Cloud Storage, you can continue to use Cloud Storage as your storage system and take advantage of BigQuery’s data processing capabilities. Moreover, BigQuery’s managed storage is able to provide a higher level of automation, performance, security, and capability–something to consider as you move forward.

Loading Hive partitioned data into BigQuery
In addition to the native functionality provided by BigQuery, you may take advantage of the convenient command-line open source utility Hive External Table Loader for BigQuery. This utility aids in loading Hive partitioned data into BigQuery.

You may want to use this tool if:

  • Your Hive partitioned data does not have a default Hive partitioned layout encoding all partition keys and values

  • Your Hive partitioned data does not share a common source URI prefix for all URIs and requires metastore for partition locations

  • Your Hive partitioned data relies on metastore positional column matching for schema detection

Commitment to open data warehousing
BigQuery’s original columnar file format ColumnIO inspired the open source ecosystem to develop open columnar file formats, including Parquet. Today, dozens of exabytes are stored in Parquet across organizations of all shapes and sizes. This data format has come full circle: Parquet is now a first-class citizen of the BigQuery ecosystem. We’re pleased to be able to continue our commitment to open source with this integration.

“In 2012, I worked on a side project that was going to become the basis for Apache Parquet: I implemented the column-striping algorithm from ColumnIO based on the Dremel paper,” says Julien Le Dem, vice president, Apache Parquet. “At the time, Google had recently made that technology available through BigQuery. I didn’t imagine that one day they would support Parquet, integrating the work of its contributors. That’s the magic of open source!”

Learn more about staying in the land of the living with BigQuery
For more information and practical examples on how to take advantage of Parquet, ORC, and Hive partitioned data, head over to the documentation. As always, you can try BigQuery with our free perpetual tier of 1TB of data processed and 10GB of data stored per month. Keep your data well away from the land of the undead with our rich ecosystem across different file formats and storage types.

Leave a Reply

Your email address will not be published. Required fields are marked *