En el blog de databricks (creadores de Spark), existen varios artículos interesantes para aprender el funcionamiento de Spark. Estos artículos suelen estar en python o scala pero pocas veces en R (¡por ahora!). Vamos a seguir el artículo “Statistical and Mathematical Functions with DataFrames in Spark” y a pasar estos ejemplos a SparkR (lo que se pueda).

Link: https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

Arrancamos la sesión de Spark y vamos convirtiendo los ejemplos:

sc <- sparkR.init(master = "local[*]",appName = "Prueba V")
## Launching java with spark-submit command ./spark-1.5.1-bin-hadoop2.6//bin/spark-submit   sparkr-shell /tmp/Rtmp8iJL3c/backend_port271017fff889
sqlContext <- sc %>% sparkRSQL.init()

1. Random Data Generation

La primera función que utiliza es range pero esta función no existe en SparkR por ahora…
Se puede solventar creando un data.frame local primero

df <- sqlContext %>% createDataFrame(data.frame(id=0:9))
df %>% collect()
##    id
## 1   0
## 2   1
## 3   2
## 4   3
## 5   4
## 6   5
## 7   6
## 8   7
## 9   8
## 10  9
df<- df %>% 
        uniform = rand(10),
        normal  = randn(27)

df %>% head
##   id   uniform     normal
## 1  0 0.7224978 -0.1875349
## 2  1 0.3312021 -0.8692578
## 3  2 0.2438486 -2.3723400
## 4  3 0.4875104 -1.2455888
## 5  4 0.6684544 -0.6032161
## 6  5 0.2437810 -0.5759153

2. Summary and Descriptive Statistics

df %>% describe() %>% collect()
##   summary                 id             uniform              normal
## 1   count                 10                  10                  10
## 2    mean                4.5 0.46856673400291776 -0.5036050224529982
## 3  stddev 2.8722813232690143  0.2358201303075058  0.8648117073061448
## 4     min                  0  0.2009218252543502  -2.372340011831022
## 5     max                  9  0.9528301401955117   0.600053806707523
df %>% describe('uniform', 'normal') %>% collect()
##   summary             uniform              normal
## 1   count                  10                  10
## 2    mean 0.46856673400291776 -0.5036050224529982
## 3  stddev  0.2358201303075058  0.8648117073061448
## 4     min  0.2009218252543502  -2.372340011831022
## 5     max  0.9528301401955117   0.600053806707523
df %>% select(mean(.$uniform), min(.$uniform), max(.$uniform)) %>% collect()
##   avg(uniform) min(uniform) max(uniform)
## 1    0.4685667    0.2009218    0.9528301

3. Sample covariance and correlation

df %>% select("id") %>% 
          withColumn('rand1', rand(seed=10)) %>% 
          withColumn('rand2', rand(seed=27)) %>% 
##   id     rand1      rand2
## 1  0 0.7224978 0.41346995
## 2  1 0.3312021 0.09891769
## 3  2 0.2438486 0.42644756
## 4  3 0.4875104 0.46138155
## 5  4 0.6684544 0.19607867
## 6  5 0.2437810 0.20983378

Las funciones cov y corr no están todavía disponibles desde SparkR, está previsto para la versión 1.6: https://issues.apache.org/jira/browse/SPARK-10752.

4. Cross Tabulation (Contingency Table)

names = c("Alice", "Bob", "Mike")
items = c("milk", "bread", "butter", "apples", "oranges")

df <- sqlContext %>% 
          data.frame(name = names[rep_len(1:3, 100)] ,
                     item = items[rep_len(1:5, 100)]

df %>% head(10)
##     name    item
## 1  Alice    milk
## 2    Bob   bread
## 3   Mike  butter
## 4  Alice  apples
## 5    Bob oranges
## 6   Mike    milk
## 7  Alice   bread
## 8    Bob  butter
## 9   Mike  apples
## 10 Alice oranges
df %>% crosstab('name', 'item') %>% head
##   name_item apples oranges butter milk bread
## 1       Bob      6       7      7    6     7
## 2      Mike      7       6      7    7     6
## 3     Alice      7       7      6    7     7

5. Frequent Items

De nuevo la función necesaria freqItems no está disponible y se espera para la versión 1.6: https://issues.apache.org/jira/browse/SPARK-10905

6. Mathematical Functions

df <- sqlContext %>% 
        createDataFrame(data.frame(id=0:9)) %>% 
        withColumn('uniform', rand(seed=10) * 3.14)

df %>% select('uniform') %>% 
    (cos(df[['uniform']]) ** 2 + sin(.$uniform) ** 2) %>% alias("cos^2 + sin^2")
  ) %>% head
##     uniform DEGREES(uniform) cos^2 + sin^2
## 1 2.2686431        129.98367             1
## 2 1.0399746         59.58616             1
## 3 0.7656847         43.87050             1
## 4 1.5307826         87.70738             1
## 5 2.0989468        120.26079             1
## 6 0.7654724         43.85834             1

Cerramos Spark:


