KolmogorovSmirnovTest#
- class pyspark.ml.stat.KolmogorovSmirnovTest[source]#
- Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. - By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. - New in version 2.4.0. - Methods - test(dataset, sampleCol, distName, *params)- Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. - Methods Documentation - static test(dataset, sampleCol, distName, *params)[source]#
- Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. - New in version 2.4.0. - Parameters
- datasetpyspark.sql.DataFrame
- a Dataset or a DataFrame containing the sample of data to test. 
- sampleColstr
- Name of sample column in dataset, of any numerical type. 
- distNamestr
- a string name for a theoretical distribution, currently only support “norm”. 
- paramsfloat
- a list of float values specifying the parameters to be used for the theoretical distribution. For “norm” distribution, the parameters includes mean and variance. 
 
- dataset
- Returns
- A DataFrame that contains the Kolmogorov-Smirnov test result for the input sampled data.
- This DataFrame will contain a single Row with the following fields:
 - pValue: Double
 
- statistic: Double
 
 
 - Examples - >>> from pyspark.ml.stat import KolmogorovSmirnovTest >>> dataset = [[-1.0], [0.0], [1.0]] >>> dataset = spark.createDataFrame(dataset, ['sample']) >>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).first() >>> round(ksResult.pValue, 3) 1.0 >>> round(ksResult.statistic, 3) 0.175 >>> dataset = [[2.0], [3.0], [4.0]] >>> dataset = spark.createDataFrame(dataset, ['sample']) >>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).first() >>> round(ksResult.pValue, 3) 1.0 >>> round(ksResult.statistic, 3) 0.175