Ejercicios de Python a R

En el blog de databricks (creadores de Spark), existen varios artículos interesantes para aprender el funcionamiento de Spark. Estos artículos suelen estar en python o scala pero pocas veces en R (¡por ahora!). Vamos a seguir el artículo “Statistical and Mathematical Functions with DataFrames in Spark” y a pasar estos ejemplos a SparkR (lo que se pueda).

Link: https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

Arrancamos la sesión de Spark y vamos convirtiendo los ejemplos:

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R/lib/"),.libPaths()))
library(SparkR)
library(magrittr)
sc <- sparkR.init(master = "local[*]",appName = "Prueba V")

## Launching java with spark-submit command ./spark-1.5.1-bin-hadoop2.6//bin/spark-submit   sparkr-shell /tmp/Rtmp8iJL3c/backend_port271017fff889

sqlContext <- sc %>% sparkRSQL.init()

Statistical and Mathematical Functions with DataFrames in Spark

1. Random Data Generation

La primera función que utiliza es range pero esta función no existe en SparkR por ahora…
Se puede solventar creando un data.frame local primero

df <- sqlContext %>% createDataFrame(data.frame(id=0:9))
        
df %>% collect()

##    id
## 1   0
## 2   1
## 3   2
## 4   3
## 5   4
## 6   5
## 7   6
## 8   7
## 9   8
## 10  9

df<- df %>% 
      mutate(
        uniform = rand(10),
        normal  = randn(27)
      )

df %>% head

##   id   uniform     normal
## 1  0 0.7224978 -0.1875349
## 2  1 0.3312021 -0.8692578
## 3  2 0.2438486 -2.3723400
## 4  3 0.4875104 -1.2455888
## 5  4 0.6684544 -0.6032161
## 6  5 0.2437810 -0.5759153

2. Summary and Descriptive Statistics

df %>% describe() %>% collect()

##   summary                 id             uniform              normal
## 1   count                 10                  10                  10
## 2    mean                4.5 0.46856673400291776 -0.5036050224529982
## 3  stddev 2.8722813232690143  0.2358201303075058  0.8648117073061448
## 4     min                  0  0.2009218252543502  -2.372340011831022
## 5     max                  9  0.9528301401955117   0.600053806707523

df %>% describe('uniform', 'normal') %>% collect()

##   summary             uniform              normal
## 1   count                  10                  10
## 2    mean 0.46856673400291776 -0.5036050224529982
## 3  stddev  0.2358201303075058  0.8648117073061448
## 4     min  0.2009218252543502  -2.372340011831022
## 5     max  0.9528301401955117   0.600053806707523

df %>% select(mean(.$uniform), min(.$uniform), max(.$uniform)) %>% collect()

##   avg(uniform) min(uniform) max(uniform)
## 1    0.4685667    0.2009218    0.9528301

3. Sample covariance and correlation

df %>% select("id") %>% 
          withColumn('rand1', rand(seed=10)) %>% 
          withColumn('rand2', rand(seed=27)) %>% 
          head

##   id     rand1      rand2
## 1  0 0.7224978 0.41346995
## 2  1 0.3312021 0.09891769
## 3  2 0.2438486 0.42644756
## 4  3 0.4875104 0.46138155
## 5  4 0.6684544 0.19607867
## 6  5 0.2437810 0.20983378

Las funciones cov y corr no están todavía disponibles desde SparkR, está previsto para la versión 1.6: https://issues.apache.org/jira/browse/SPARK-10752.

4. Cross Tabulation (Contingency Table)

names = c("Alice", "Bob", "Mike")
items = c("milk", "bread", "butter", "apples", "oranges")

df <- sqlContext %>% 
        createDataFrame(
          data.frame(name = names[rep_len(1:3, 100)] ,
                     item = items[rep_len(1:5, 100)]
                     ))

df %>% head(10)

##     name    item
## 1  Alice    milk
## 2    Bob   bread
## 3   Mike  butter
## 4  Alice  apples
## 5    Bob oranges
## 6   Mike    milk
## 7  Alice   bread
## 8    Bob  butter
## 9   Mike  apples
## 10 Alice oranges

df %>% crosstab('name', 'item') %>% head

##   name_item apples oranges butter milk bread
## 1       Bob      6       7      7    6     7
## 2      Mike      7       6      7    7     6
## 3     Alice      7       7      6    7     7

5. Frequent Items

De nuevo la función necesaria freqItems no está disponible y se espera para la versión 1.6: https://issues.apache.org/jira/browse/SPARK-10905

6. Mathematical Functions

df <- sqlContext %>% 
        createDataFrame(data.frame(id=0:9)) %>% 
        withColumn('uniform', rand(seed=10) * 3.14)

df %>% select('uniform') %>% 
  mutate(
    toDegrees(.$uniform),
    (cos(df[['uniform']]) ** 2 + sin(.$uniform) ** 2) %>% alias("cos^2 + sin^2")
  ) %>% head

##     uniform DEGREES(uniform) cos^2 + sin^2
## 1 2.2686431        129.98367             1
## 2 1.0399746         59.58616             1
## 3 0.7656847         43.87050             1
## 4 1.5307826         87.70738             1
## 5 2.0989468        120.26079             1
## 6 0.7654724         43.85834             1

Cerramos Spark:

sparkR.stop()

Este obra está bajo una licencia de Creative Commons Reconocimiento-CompartirIgual 4.0 Internacional.

Taller: SparkR (R on Spark) V

VII JORNADAS DE USUARIOS DE R

Salamanca, 5 de noviembre de 2015
Jorge Ayuso Rejas

Ejercicios de Python a R

Statistical and Mathematical Functions with DataFrames in Spark

1. Random Data Generation

2. Summary and Descriptive Statistics

3. Sample covariance and correlation

4. Cross Tabulation (Contingency Table)

5. Frequent Items

6. Mathematical Functions

Taller: SparkR (R on Spark) V

VII JORNADAS DE USUARIOS DE R

Salamanca, 5 de noviembre de 2015Jorge Ayuso Rejas

Ejercicios de Python a R

Statistical and Mathematical Functions with DataFrames in Spark

1. Random Data Generation

2. Summary and Descriptive Statistics

3. Sample covariance and correlation

4. Cross Tabulation (Contingency Table)

5. Frequent Items

6. Mathematical Functions

Salamanca, 5 de noviembre de 2015
Jorge Ayuso Rejas