Introduction

Hi everyone, this page is to introduce the package AnalysisLin, which is my personal package for exploratory data analysis. It includes several useful functions designed to assist with exploratory data analysis (EDA). These functions are based on my learnings throughout my academic years, and I personally use them for EDA.

Table below summarize the functions that would be going over in this page

df <- data.frame(
  Descriptive_Statistics = c("desc_stat()","","","","",""),
  Data_Visualization = c("hist_plot()","dens_plot()", "bar_plot()","pie_plot()","qq_plot()","missing_value_plot()"),
  Correlation_Analysis = c("corr_matrix()", "corr_cluster()","","","",""),
  Feature_Engineering = c("missing_impute()", "pca()","","","","")
)
kable(df)

Descriptive_Statistics	Data_Visualization	Correlation_Analysis	Feature_Engineering
desc_stat()	hist_plot()	corr_matrix()	missing_impute()
	dens_plot()	corr_cluster()	pca()
	bar_plot()
	pie_plot()
	qq_plot()
	missing_value_plot()

Some famous and very useful pre-installed datasets, such as iris, mtcars, and airquality, would be used to demonstrate what does each function in the package do. If you have not installed the package, please do the following:

install.packages(“AnalysisLin”)

data("iris")
data("mtcars")
data("Titanic")
data("airquality")

Exploratory Data Analysis, in simple words, is the process to get to know your data.

Descriptive Statistic

First function in package is desc_stat. This function computes numerous useful statistical metrics so that you gain a profound understanding of your data

Count: Number of values in a variable.
Unique: Number of values that are unique in a variable.
Duplicate: Number of rows that are duplicate in a dataset.
Null: Number of values that are missing in a variable.
Null Rate: Percentage of values that are missing in a variable.
Type: Type of variable (e.g., numeric, character, factor).
Min: Smallest value.
P25: Median of the first half.
Mean: Mean value.
Median: Median value.
P75: Median of the second half.
Max: Largest value.
SD: Standard deviation.
Kurtosis: A measure of the tailedness of a distribution.
Skewness: A measure of the asymmetry of a distribution.
Shapiro-Wilk Test: Checks if a sample follows a normal distribution by comparing its statistics to expected values under the assumption of normality.
Kolmogorov-Smirnov Test:Checks if a sample follows a normal distribution by comparing its cumulative distribution function to the expected normal distribution.
Anderson-Darling Test:Assesses normality by emphasizing tail behavior, determining if a sample conforms to a specified distribution.
Lilliefors Test:A variant of Kolmogorov-Smirnov, is tailored for small sample sizes, testing whether data is normally distributed.
Jarque-Bera Test P-value: Checks whether data have the skewness and kurtosis matching a normal distribution.

desc_stat(mtcars)

desc_stat(iris)

desc_stat(airquality)

These metrics provide valuable insights into the dataset in a deep level. If you don’t want any of these metrics to be computed, you can set them to FALSE. This way, the unwanted metrics won’t appear in the output.

Furthermore, desc_stat() can also compute Kurtosis, Skewness, Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test,Jarque-Bera Test

desc_stat(mtcars,max = F, min=F, sd=F,kurtosis = T,skewness = T,shapiro = T,anderson = T,lilliefors = T, jarque = T)

Data Visualization

Numeric Plot

To visualize histogram for all numerical variables

hist_plot(iris,subplot=F)

To visualize desnity for all numerical variables in two rows of subplots

dens_plot(iris,subplot=T,nrow=2)

A Quantile-Quantile (QQ) plot is a graphical tool used to assess whether a dataset follows a normal distribution. It compares the quantiles of the observed data to the quantiles of the expected distribution.

if you want to check the normality for numerical variables by drawing QQ plot.

qq_plot(iris,subplot = T)

Categorical Plot

To visualize bar charts for all categorical variables

bar_plot(iris)

if you want pie chart:

pie_plot(iris)

Correlation Analysis

Correlation Matrix

To visualize correlation table for all variables.

corr_matrix(mtcars)

if you want to visualize correlation map along with correlation table:

corr_matrix(mtcars,corr_plot=T)

you may also choose type of correlation:Pearson correlation and Spearman correlation.

corr_matrix(mtcars,type='pearson')
corr_matrix(mtcars,type='spearman')

Correlation Clustering

Correlation clustering partitioning data points into groups based on their similarity(correlation)

corr_cluster(mtcars,type='pearson')

corr_cluster(mtcars, type='spearman')

Feature_Engineering

Missing Value Plot

To visualize the percentage of missing values in each variable.

missing_values_plot(airquality)

Missing Value Imputation

Imputing mssing value is a way to get more information from a data with missing values. However, one need to carefully choose what method to use to impute missing values in order to reach most accuracy.

mean: use mean value to replace missing value.

impute_missing(airquality,method='mean')

mode: use most frequency value to replace missing value.
median: use median value to replace missing value.
locf: use last observation value to replace missing value.
knn: use k-nearest nerighbor to replace missing value, k needs to be chosen.

impute_missing(airquality,method='mode')
impute_missing(airquality,method='median')
impute_missing(airquality,method='locf')
impute_missing(airquality,method='knn',k=5)

Principle Component Analysis

Principle Component Analysis can help you to reduce the number of variables in a dataset. To perform and visualize PCA on some selected variables

pca(mtcars,variance_threshold = 0.9,scale=T)

to visualize the scree plot and biplot

pca(mtcars,variance_threshold = 0.9,scale=TRUE,scree_plot=TRUE,biplot=TRUE)

AnalysisLin-vignette