Optimizing Performance in Databricks: Best Practices for Caching

When working with large-scale data pipelines, query performance is often the deciding factor in how quickly insights can be derived. Databricks provides two primary caching mechanisms: disk caching (formerly known as Delta cache or DBIO cache) and Apache Spark caching. While both aim to improve performance, understanding their differences and best practices can help data teams optimize their workloads effectively.

Databricks automatically enables disk caching for Parquet and Delta files stored on cloud object storage (e.g., S3, ADLS, GCS). This cache stores copies of remote data files on the local SSDs of worker nodes, improving read speeds significantly. In contrast, Apache Spark caching stores DataFrames or RDDs in memory, reducing recomputation but consuming valuable RAM.

Key Differences:

FeatureDisk CacheApache Spark Cache
Storage MediumLocal SSDs on worker nodesJVM memory (depends on storage level)
Applicable ToAny Parquet table in cloud storageAny DataFrame or RDD
Trigger MechanismAutomatic on first read (if enabled)Manual (.cache() or .persist())
Eviction PolicyLRU or file modificationLRU or manual (unpersist())
Performance ImpactFaster reads from SSDs, reduced I/OEliminates recomputation but uses RAM
Best Use CaseLarge, frequently accessed Parquet tablesSmaller datasets needing frequent reuse

To ensure the cache is being used effectively, monitoring is essential.

Additionally, you can analyze cache efficiency in the Spark UI by looking at storage read patterns and task execution breakdowns.

Interpreting Cache Metrics

  • Hit ratio: High values indicate efficient caching; low values suggest frequent remote reads.
  • Eviction count: Frequent evictions may mean cache size is too small or the workload is too volatile.
  • Compression impact: If high CPU usage is observed, compression settings should be reassessed.

Best Practices for Optimizing Performance with Disk Cache

1. Enable Disk Caching for Parquet-Based Workloads

Databricks enables disk caching by default on specific node types with SSDs. However, if you want to ensure it’s enabled, configure the following setting:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
 

2. Optimize Cache Storage Allocation

To control how much disk space the cache consumes per node, specify the following parameters:

spark.databricks.io.cache.maxDiskUsage 100g  # Adjust per workload
spark.databricks.io.cache.maxMetaDataCache 2g
spark.databricks.io.cache.compression.enabled true  # Saves space but may impact CPU
 

Tip: Compression reduces storage usage but may introduce additional CPU overhead. Test with different settings to find the optimal balance.

3. Balancing Autoscaling with Caching

In an autoscaling environment, disk cache persistence can be challenging. When a worker is decommissioned, its cached data is lost, impacting query performance. To mitigate this:

  • Use a core group of fixed nodes with caching enabled, while allowing ephemeral nodes for burst scaling.
  • Increase spark.databricks.io.cache.maxDiskUsage to maintain more cached data per node.
  • Use manual cache warming strategies, preloading frequently accessed data before scaling.

Disk cache compression can improve storage efficiency but may introduce CPU overhead. Consider the following:

CompressionStorage SavingsCPU Overhead
Enabled~40% reduction+15% CPU usage
DisabledNo reductionNo additional CPU load

If CPU is a bottleneck, disabling compression may yield better query latencies.

For optimal performance, frequently accessed hot data should reside in Spark memory (.cache()), while warm data can remain in disk cache. This balances memory efficiency with performance.

# Example: Persist only critical DataFrame in memory
critical_df = spark.read.format("delta").load("/path/critical_data").cache()

# Less critical data can rely on disk caching
secondary_df = spark.read.format("delta").load("/path/secondary_data")
 

To maximize caching efficiency:

  • Use Z-Ordering on frequently filtered columns to improve cache locality.
  • Optimize file layouts to reduce small file fragmentation, which can degrade caching benefits.
  • Regularly compact small Parquet files using Delta Lake’s OPTIMIZE command.

1. Cache Not Being Used

  • Verify that caching is enabled:
spark.conf.get("spark.databricks.io.cache.enabled")

  • Ensure Parquet or Delta formats are being used (other formats do not benefit from disk caching).

2. Cache Evicting Too Frequently

  • Increase
spark.databricks.io.cache.maxDiskUsage

to store more data locally.

  • Use larger SSD instance types for more cache space.

3. Query Performance Not Improving

  • Check partitioning strategy to avoid unnecessary cache misses.
  • Use Spark UI to verify where data is being read from (remote vs. local cache).

Disk caching is not ideal for streaming workloads due to frequent data changes. For structured streaming:

  • Prefer in-memory caching for reference data.
  • Use efficient partitioning strategies to avoid excessive cache invalidation.

Caching is one of the most powerful ways to enhance query performance in Databricks, especially for cloud-based Parquet and Delta workloads. By choosing the right worker types, fine-tuning cache settings, and monitoring cache efficiency, you can significantly improve read speeds and reduce overall query execution times.

Need help optimizing your Databricks performance? B EYE’s Databricks experts can assist with caching strategies, cluster tuning, and end-to-end data architecture improvements. Explore our Databricks Consulting Services for more information.

Contact us


Stay on Top of Data Trends

Author
Marta Teneva
Marta Teneva, Head of Content at B EYE, specializes in creating insightful, research-driven publications on BI, data analytics, and AI, co-authoring eBooks and ensuring the highest quality in every piece.

Related Articles

Discover the
B EYE Standard