Caching in Pyspark

Internals of Caching in Pyspark

Caching DataFrames in PySpark is a powerful technique to improve query performance. However, there’s a subtle difference in how you can cache DataFrames in PySpark.

cached_df = orders_df.cache() and orders_df.cache() are two common approaches & they serve different purposes.The choice between these two depends on your specific use case and whether you plan to reuse the cached DataFrame in your code.

Let’s discuss with the sample example :

Consider there is an dataframe named orders_df with 9 partitions

Approach 1 :

1. orders_df.select(“order_id”,”order_status”).filter(“order_id > 10”).cache()

When the statement1 is executed, the cache will happen by storing only one partition because cache is a lazy operation.

2. orders_df.select(“order_id”,”order_status”).filter(“order_id > 10”).count()

Here, in this statement2 it hits the cache() and it will do complete caching as count() operation is performed.

3. orders_df.select(“order_id”,”order_status”).filter(“order_id > 100”).count()

Here, in this statement3, it should ideally hit the cache() because we are trying to get the subset of data which already got cached in the statement2 but the statement3 won’t hit the cache(). This happens due the “Analyse Logical plan” which is different for both statement2 & statement3 and Analyse Logical Plan is one of the operations in Spark Execution plan.

4. orders_df.select(“order_id”,”order_status”).filter(“order_id > 100”).cache()

5. orders_df.select(“order_id”,”order_status”).filter(“order_id > 100”).count()

Now in statement5, it’ll hit the cache because the “Analyse Logical plan” is same for both statement4 & statement5

In short, when a dataframe is cached & later any operation is invoked related to the cached dataframe, then, at that point of time, sparks checks for matches in “Analyse Logical Plan” of them before hitting the cache().

Approach 2 :

Now, let’s check how the storing of cached data frame to a new variable helps :

cached_df = orders_df.select(“order_id”,”order_status”).filter(“order_id > 10”).cache()

here if we say,

cached_df.select(“order_id”,”order_status”).filter(“order_id > 10”).count()

cached_df.select(“order_id”,”order_status”).filter(“order_id > 100”).count()

Here, in both the cases it’ll hit the cache()

In the second approach, it’s clear that cached_df is the cached DataFrame, and we can easily reuse it without worrying about calling cache() again & again. So this approach promotes code clarity, maintainability, and reduces the risk of errors related to caching.

PS~ I teach big data. visit my website https://lnkd.in/gt_jpCyE to know more about my big data program. The new batch just started on Saturday.

#bigdata #dataengineering #apachespark #interview

Leave a comment

Create a website or blog at WordPress.com

Up ↑

Design a site like this with WordPress.com
Get started