Public View
Suggest
Download this page (.md) Download entire wiki (.zip)
Clone entire wiki

Resilient Distributed Dataset

Importantly, we have to keep our data under something that can be called RDD: “Resilient Distributed Dataset”; it is a theoretical dataset, but you don’t actually load it.
RDDs are has a single vector datastore under, but there are special RDDs that store key-value info. For Spark, RDDs are stored as operational graphs which is backtraced eventually during computational steps.
Pair RDD A Pair RDD is an RDD that stores two pairs of vectors: you have a key and you have an value per entry.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?