Caching in PySpark: Techniques and Best Practices

Photo by Emma Simpson on Unsplash
  1. In-memory caching: This type of caching involves storing data in the memory of the nodes in a distributed system. This can provide fast access to data, but can also be expensive and may not be feasible for very large datasets.
  2. Disk-based caching: This type of caching involves storing data on a disk, either on the local disk of each node in a distributed system or on a shared disk that is accessible to all nodes. This can provide a balance between performance and cost, but may not be as fast as in-memory caching.
  3. Computer-based caching: This type of caching involves recomputing the results of a computation instead of storing the data itself. This can be useful for computations that are expensive to perform but can be easily recomputed, such as aggregations or transformations.
df.cache()
df.persist(StorageLevel.DISK_ONLY)
# Load the transactions dataset into an RDD
transactions = sc.textFile("transactions.csv")

# Map each transaction to a tuple containing the date and the number 1
transactions_by_date = transactions.map(lambda x: (x.split(",")[0], 1))

# Reduce the mapped transactions by date to compute the total number of transactions per day
transactions_per_day = transactions_by_date.reduceByKey(lambda x, y: x + y)

# Cache the results of the computation
transactions_per_day.cache()

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Scalli

Writing about Technical Sales, Data Science, Cool Engineering Topics, and Life!