When working with large-scale data pipelines, query performance is often the deciding factor in how quickly insights can be derived. Databricks provides two primary caching mechanisms: disk caching (formerly known as Delta cache or DBIO cache) and Apache Spark caching. While both aim to improve performance, understanding their differences and best practices can help data teams optimize their workloads effectively.
Understanding Disk Cache vs. Spark Cache
Databricks automatically enables disk caching for Parquet and Delta files stored on cloud object storage (e.g., S3, ADLS, GCS). This cache stores copies of remote data files on the local SSDs of worker nodes, improving read speeds significantly. In contrast, Apache Spark caching stores DataFrames or RDDs in memory, reducing recomputation but consuming valuable RAM.
Key Differences:
Feature | Disk Cache | Apache Spark Cache |
Storage Medium | Local SSDs on worker nodes | JVM memory (depends on storage level) |
Applicable To | Any Parquet table in cloud storage | Any DataFrame or RDD |
Trigger Mechanism | Automatic on first read (if enabled) | Manual (.cache() or .persist()) |
Eviction Policy | LRU or file modification | LRU or manual (unpersist()) |
Performance Impact | Faster reads from SSDs, reduced I/O | Eliminates recomputation but uses RAM |
Best Use Case | Large, frequently accessed Parquet tables | Smaller datasets needing frequent reuse |
Monitoring Cache Hit Rates
To ensure the cache is being used effectively, monitoring is essential.
# Python snippet to observe cache usage and hit rates
stats = spark._jvm.com.databricks.sql.io.cache.CacheManager.cacheStats()
print(stats)
Additionally, you can analyze cache efficiency in the Spark UI by looking at storage read patterns and task execution breakdowns.
Interpreting Cache Metrics
- Hit ratio: High values indicate efficient caching; low values suggest frequent remote reads.
- Eviction count: Frequent evictions may mean cache size is too small or the workload is too volatile.
- Compression impact: If high CPU usage is observed, compression settings should be reassessed.
Best Practices for Optimizing Performance with Disk Cache
1. Enable Disk Caching for Parquet-Based Workloads
Databricks enables disk caching by default on specific node types with SSDs. However, if you want to ensure it’s enabled, configure the following setting:
spark.conf.set("spark.databricks.io.cache.enabled", "true")
2. Optimize Cache Storage Allocation
To control how much disk space the cache consumes per node, specify the following parameters:
spark.databricks.io.cache.maxDiskUsage 100g # Adjust per workload
spark.databricks.io.cache.maxMetaDataCache 2g
spark.databricks.io.cache.compression.enabled true # Saves space but may impact CPU
Tip: Compression reduces storage usage but may introduce additional CPU overhead. Test with different settings to find the optimal balance.
3. Balancing Autoscaling with Caching
In an autoscaling environment, disk cache persistence can be challenging. When a worker is decommissioned, its cached data is lost, impacting query performance. To mitigate this:
- Use a core group of fixed nodes with caching enabled, while allowing ephemeral nodes for burst scaling.
- Increase spark.databricks.io.cache.maxDiskUsage to maintain more cached data per node.
- Use manual cache warming strategies, preloading frequently accessed data before scaling.
Compression Trade-Offs in Disk Caching
Disk cache compression can improve storage efficiency but may introduce CPU overhead. Consider the following:
Compression | Storage Savings | CPU Overhead |
Enabled | ~40% reduction | +15% CPU usage |
Disabled | No reduction | No additional CPU load |
If CPU is a bottleneck, disabling compression may yield better query latencies.
Hybrid Caching: Layered Strategy for Hot vs. Warm Data
For optimal performance, frequently accessed hot data should reside in Spark memory (.cache()), while warm data can remain in disk cache. This balances memory efficiency with performance.
# Example: Persist only critical DataFrame in memory
critical_df = spark.read.format("delta").load("/path/critical_data").cache()
# Less critical data can rely on disk caching
secondary_df = spark.read.format("delta").load("/path/secondary_data")
File Layout and Partitioning for Optimal Caching
To maximize caching efficiency:
- Use Z-Ordering on frequently filtered columns to improve cache locality.
- Optimize file layouts to reduce small file fragmentation, which can degrade caching benefits.
- Regularly compact small Parquet files using Delta Lake’s OPTIMIZE command.
Troubleshooting Disk Caching Issues
1. Cache Not Being Used
- Verify that caching is enabled:
spark.conf.get("spark.databricks.io.cache.enabled")
- Ensure Parquet or Delta formats are being used (other formats do not benefit from disk caching).
2. Cache Evicting Too Frequently
spark.databricks.io.cache.maxDiskUsage
to store more data locally.
- Use larger SSD instance types for more cache space.
3. Query Performance Not Improving
- Check partitioning strategy to avoid unnecessary cache misses.
- Use Spark UI to verify where data is being read from (remote vs. local cache).
Considerations for Streaming Workloads
Disk caching is not ideal for streaming workloads due to frequent data changes. For structured streaming:
- Prefer in-memory caching for reference data.
- Use efficient partitioning strategies to avoid excessive cache invalidation.
Optimizing Your Databricks Performance: Next Steps
Caching is one of the most powerful ways to enhance query performance in Databricks, especially for cloud-based Parquet and Delta workloads. By choosing the right worker types, fine-tuning cache settings, and monitoring cache efficiency, you can significantly improve read speeds and reduce overall query execution times.
Need help optimizing your Databricks performance? B EYE’s Databricks experts can assist with caching strategies, cluster tuning, and end-to-end data architecture improvements. Explore our Databricks Consulting Services for more information.
Stay on Top of Data Trends