Dr. Luciano Nocera
Examples: mean, median, SD
Examples: hypothesis testing, regression analysis
Observed | ML | Stats |
---|---|---|
Observations | Samples | Cases |
Attribute | Feature | Independent variable |
Class | Label | Dependent variable |
Height depends on age
Time spent studying affects test score
Medication in persons with Parkinson's Disease affects the SD of the step length
Statistic | Nominal | Ordinal | Interval | Ratio |
---|---|---|---|---|
Frequency | Yes | Yes | Yes | Yes |
Median and percentile | No | Yes | Yes | Yes |
Mean, SD, SEM* | No | No | Yes | Yes |
Ratio, rate of variation | No | No | No | Yes |
Chol. (mg/dl) | No. | Rel. Freq. | Cum. Freq. |
---|---|---|---|
80-119 | 13 | 1.2 | 1.2 |
120-159 | 150 | 14.1 | 15.3 |
160-199 | 442 | 41.4 | 56.7 |
200-239 | 299 | 28.0 | 84.7 |
240-279 | 115 | 10.8 | 95.5 |
280-319 | 34 | 3.2 | 98.7 |
320-359 | 9 | 0.8 | 99.5 |
360-399 | 5 | 0.5 | 100.0 |
73, 42, 67, 78, 99, 84, 91, 82, 86, 122
42, 67, 73, 78, 82, 84, 86, 91, 99, 122
42, 67, 73, 78, 82, 84, 86, 91, 99, 122
4 | 2
5 |
6 | 7
7 | 38
8 | 246
9 | 19
10 |
11 |
12 | 2
4 | 2
6 | 738
8 | 24619
10 |
12 | 2
73, 42, 67, 78, 99, 84, 91, 82, 86, 122
42, 67, 73, 78, 82, 84, 86, 91, 99, 122
range = max - min = 122 - 42 = 80 bin size 20 bin size 40
Interval | Frequency |
---|---|
40-60 | 1 |
60-80 | 3 |
80-100 | 5 |
100-120 | 0 |
120-140 | 1 |
Interval | Frequency |
---|---|
40-80 | 4 |
80-120 | 5 |
120-140 | 1 |
Scree plot
Biplot [Gabriel 71]
# d1: Int. Derang. (DDWR) / Int. Derang. (eDDNR)
No Yes
188 112
Call:
randomForest(formula = target, data = df, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 11
OOB estimate of error rate: 3%
Confusion matrix:
No Yes class.error
No 187 1 0.005319149
Yes 8 104 0.071428571
# d1: Int. Derang. (DDWR) / Int. Derang. (eDDNR)
No Yes
188 112
Call:
randomForest(formula = target, data = df, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 11
OOB estimate of error rate: 3%
Confusion matrix:
No Yes class.error
No 187 1 0.005319149
Yes 8 104 0.071428571
Top 10 variables
No Yes
1 0.990 0.010
2 0.988 0.012
3 0.992 0.008
4 0.108 0.892
5 0.970 0.030
6 0.990 0.010
7 0.962 0.038
8 0.040 0.960
9 0.986 0.014
10 0.042 0.958
Setting levels: control = No, case = Yes
Setting direction: controls < cases
Area under the curve: 0.9974
> df <- sample_n(mpg, 36)
> df$manufacturer <- factor(df$manufacturer)
> df
# A tibble: 36 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<fct> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 toyota camry 2.4 2008 4 auto(l5) f 21 31 r midsize
2 toyota camry solara 2.4 2008 4 manual(m5) f 21 31 r compact
3 dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
4 chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
5 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
6 jeep grand cherokee 4wd 4.7 1999 8 auto(l4) 4 14 17 r suv
7 hyundai tiburon 2 1999 4 manual(m5) f 19 29 r subcompact
8 dodge dakota pickup 4wd 3.9 1999 6 manual(m5) 4 14 17 r pickup
9 toyota camry solara 3 1999 6 auto(l4) f 18 26 r compact
10 ford expedition 2wd 4.6 1999 8 auto(l4) r 11 17 r suv
# ... with 26 more rows
> summary(df$manufacturer)
audi chevrolet dodge ford honda hyundai jeep land rover
3 2 5 5 2 2 2 1
nissan pontiac subaru toyota volkswagen
2 1 1 7 3
Granite | Limestone | Sandstone | |
Trad | 36 | 0 | 52 |
Sport | 76 | 8 | 41 |
Bouldering | 102 | 0 | 13 |
rock | type | count |
---|---|---|
Granite | Trad | 36 |
Granite | Sport | 76 |
Granite | Bouldering | 102 |
Limestone | Trad | 0 |
Limestone | Sport | 8 |
Limestone | Bouldering | 0 |
Sandstone | Trad | 52 |
Sandstone | Sport | 41 |
Sandstone | Bouldering | 13 |
import matplotlib.pyplot as plt
import numpy as np
T = np.arange(0.0, 2.0, 0.01)
S = 1 + np.sin(2*np.pi*t)
plt.plot(T, S)
plt.xlabel('time (s)')
plt.ylabel('voltage (mV)')
plt.title('About as simple as it gets, folks')
plt.grid(True)
plt.show()
import numpy as np
import seaborn as sns
x = 5 + np.arange(20) +
np.random.randn(20)
y = 10 + np.arange(20) +
5 * np.random.randn(20)
sns.regplot(x, y)
Acceleration Cylinders Displacement Horsepower Miles_per_Gallon Name Origin Weight_in_lbs Year
0 12.0 8 307.0 130.0 18.0 chevrolet chevelle malibu USA 3504 1970-01-01
1 11.5 8 350.0 165.0 15.0 buick skylark 320 USA 3693 1970-01-01
2 11.0 8 318.0 150.0 18.0 plymouth satellite USA 3436 1970-01-01
3 12.0 8 304.0 150.0 16.0 amc rebel sst USA 3433 1970-01-01
4 10.5 8 302.0 140.0 17.0 ford torino USA 3449 1970-01-01
...
import seaborn as sns
from vega_datasets import data
cars = data.cars()
sns.scatterplot(
x='Horsepower',
y='Miles_per_Gallon',
hue='Origin',
data=cars);
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
...
#ggplot(Data, Mapping) + Geom
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
...
(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
+ geom_point()
+ stat_smooth(method='lm')
+ facet_wrap('~gear'))
import altair as alt
# load a simple dataset as a pandas DataFrame
from vega_datasets import data
cars = data.cars()
alt.Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
).interactive()
Graphic defined by a grammar of components
Defaults
Data
Mapping** |
A default dataset and set of mappings from variables to aesthetics |
Layer
Data
Mapping Geom Stat Position |
One or more layers, each composed of a geometric object, a statistical transformation, a position adjustment, and optionally, a dataset and aesthetic mappings |
- Coord - Facet |
A coordinate system The facetting specification |
ggplot(data=mpg, aes(x=hwy, y=cty)) + geom_point() #Defaults
ggplot(mpg, aes(hwy, cty)) + geom_point() #positional args
ggplot(mpg) + geom_point(aes(hwy, cty)) #Mapping in layer
# Same using a variable
p <- ggplot(mpg, aes(hwy, cty)) #set Defaults
p + geom_point() #add Layer with Geom
# mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
aes(x = mpg, y = wt)
#> Aesthetic mapping:
#> * `x` -> `mpg`
#> * `y` -> `wt`
# You can also map aesthetics to functions of variables
aes(x = mpg ^ 2, y = wt / cyl)
#> Aesthetic mapping:
#> * `x` -> `mpg^2`
#> * `y` -> `wt/cyl`
# Or to constants
aes(x = 1, colour = "smooth")
#> Aesthetic mapping:
#> * `x` -> 1
#> * `colour` -> "smooth"
ggplot(mpg, aes(x=hwy, y=cty, color=manufacturer, size=displ)) + geom_point() #x, y
ggplot(mpg, aes(hwy, cty, color=manufacturer, size=displ)) + geom_point() #color
ggplot(mpg, aes(hwy, cty), color=manufacturer, size=displ) + geom_point() #bad
ggplot(mpg, aes(hwy, cty, col=manufacturer, size=displ)) + geom_point() #col
ggplot(mpg, aes(hwy, cty, colour=manufacturer, size=displ)) + geom_point() #colour
ggplot(mpg, aes(hwy, cty)) + geom_point(aes(color=manufacturer, size=displ))
ggplot(mpg, aes(hwy, cty)) + geom_point(color=manufacturer, size=displ) #bad!
> ggplot(mpg, aes(hwy, cty)) + #Defaults
geom_point() + #add Geom point Layer
geom_smooth() #add Geom smooth Layer (regression)
All understand x, y, color and size aesthetics.
Filled geoms also understand fill.
Scatterplot | geom_point() |
Text | geom_text() |
Bar chart | geom_bar() |
Line chart | geom_line() |
Area chart | geom_area() |
Dot plot | geom_dotplot() |
Histogram | geom_histogram() |
Frequency polygon | geom_freqpoly() |
Box plot | geom_boxplot() |
Violin plot | geom_violin() |
tilde Operatorseparates the left- and right-hand sides
# Multiple linear regression
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
New notation | Old formula interface* | |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
p <- ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()
p + theme_bw() + ggtitle("theme_bw")
p + theme_minimal() + ggtitle("theme_minimal")
library(ggthemes) #extra themes
p + theme_tufte() + ggtitle("theme_tufte")
theme_set(theme_bw()) #sets the theme for all subsequent ggplot plots
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species, size=Petal.Length)) + geom_point()
geom_point(shape=1)
to draw circle outlinex
↔ Columny
↔ Rows