Description
1 Introduction
The analysis regards the AIDA dataset. It presents itself with 80 features concerning 1894412
companies that compose the data frame. Most of these features are about economical and
financial indicators of the companies during the last three years of activity. Some other features
are present and regard information such as the geographical headquarter, opening year, last
accounting closing year and legal form. Furthermore, the feature Legal status shows the financial
situation of the firm at its last accounting closing date and it is the one used in order to assign
a target value Failed equal to 1 or 0, indicating respectively failure or not. Indeed, the goals of
this study are both the analysis of the characteristics of failing companies and business failure
prediction. In order to perform such an analysis, the report presents a distinct section for every
Task from A to E.
2 Overview and Methods
The first three Tasks require comparing the distributions and conditional probability of failure
of some features such as Age or Size and detecting any change depending on the target value
Failed, company form, ATECO industry sector, Last accounting closing year of the company
and location (Area). The last two Tasks ask to develop a classification model to predict a
company’s probability of failure and analyze the results.
2.1 Code
The project was developed by using R programming language and R Studio IDE. Any required
library such as DescTools, dplyr, ggplot2 and caret is indicated in the code scripts. The code
scripts address Tasks A, B, C, D and E. The first three Tasks’ scripts are saved as ”Q% data”,
where the % character changes based on the Task. For Task D there are three distinct scripts:
QD preprocessing, QD feature selection and QD E classification, where the latter addresses
both Tasks D and E.
12.2 Report structure
The current report has a distinct section for every Task from A to E. Concerning the Prepro
cessing procedures, a preliminary one was performed for Tasks A to C, while a more in-depth
one was necessary for the Parametric and Machine Learning Models construction. Therefore,
section 3 shows the preprocessing for the first three Tasks, which is the basis of Tasks D and
E’s Preprocessing shown in section 7.1.
3 Preliminary Preprocessing
Initially, it was necessary to find the definition of failure of a company. In the feature Legal
status there are different values. The final choice was to set as Active only the companies
already reporting the value ”Active” in Legal status, while the ones reporting any declination
of ”Dissolved”, ”Bankruptcy” or ”Liquidation” values were set to Failed.
The few records containing the values ”Active (default of payments)” and ”Active (receiver
ship)” were deleted; this was done in order not to have any value coming from a failure-facing
company among those considered active.
The records containing the values ”Dissolved merger” or ”Dissolved demerger” were also
deleted, as a merge or demerge of a company does not provide sufficient information about the
failure.
Therefore, the final dataset presents 65% of active companies and 35% of failed, respectively
detected by the value 0 in the new column Failed and 1.
Afterwards, two new features were added: Age and Size.
- Age
It was created by subtracting the values of Last accounting closing year and Incorporation
year, that is between the year of publication of the financial statements and the year of
the beginning of the company’s activity. Since the total number of records was more than
sufficient, all those presenting NA on at least one of the two features were deleted, as well
as those presenting a negative value for Age.
- Size
This feature was created by calculating the average number of employees in the three
available years.
Considering the high number of companies with 0 or 1 employees, the final division was set
in table 1
Size
Single
Micro
Small
Medium
Large
Extra Large
- of Employees
0
[1,3]
[3,6]
[6,10]
[10,25]
[25 +]
Table 1: Relation between Size and Employees
Subsequently, many categorical attributes presenting many distinct values were modified,
so as to reduce their quantity.
2The attribute concerning the geographical region of the company’s seat was modified by
dividing the country (Italy) into four regions: ”Sud”, ”Centro”, ”Nord-Ovest” and ”Nord-Est”.
The attribute Legal Form presented some values with very few records; these were joined un
der the value ”Others”. In particular, the following values were joined: ”Association”,”Foreign
company”,”Foundation””Mutual aid society”, ”Public agency”, ”S.A.P.A.”, ”S.N.C.”, ”S.A.S.”.
Furthermore, the following legal forms were united under the value ”S.C.A.R.L.” :”S.C.A.R.I.”,
”S.C.A.R.L.P.A.”, ”Social cooperative company”.
Finally, the feature Ateco 2007 code presented numerical values with six digits, with the
first two indicating the macrocategory of the classification of the reference economical asset.
Such values were therefore substituted with a letter indicating the macrocategory. In this case
too, the categories presented too few records and were joined under the value ”Others”.
4 Task A
In order to perform this task, it was necessary to initialy choose a year for the analysis. By
comparing the distribution of the active and failed companies throughout the various years,
it is possible to notice how, before 2005, only records of failed companies are available. From
2005 on, the records presenting value 0 on the feature Failed start increasing, which overcome
the number of failed companies in 2017 and become exponentially grater in 2018. In order to
keep a balanced analysis, the year 2016 was selected, since it has a similar number of records
for active and failed companies.
- Age
Once the data for the year 2016 were selected, it was initially attempted to find the
distribution of Age starting from the ”Cullen and Frey graph”. As shown in figure 1, the
observed data and bootstrap samples indicate a possible Pareto or Gamma distribution.
Through the Maximum Likelihood Estimator they were estimated the parameters for
both distributions and it was performed a Kolmogorov-Smirnov (KF) test in order to test
the distance from the assumed distribution. In both tests, the hypotheses were rejected,
as they had a very low pvalue.
Figure 1: Cullen and Frey graph Age
3Subsequently, the dataset was divided between the failed and active companies. In figure
2 it is shown the density of the estimation for Age for both Failed values. Despite the
curves’ differences, it was attempted to see if they had the same mean: it was impossible
to use the Shapiro test for verifying the normality of the distribution, because of the high
number of data (5000 records are permitted at most). Therefore, it was used the z-test
(the t test was also usable alternatively). The hypothesis was rejected with a very low
pvalue. The difference between the two averages has confidence intervals of 95% ranging
from 2.7 to 3, pointing out that the failed companies are older than active ones on average.
Figure 2: Density curve
As requested, it was analyzed whether there were any statistical differences or not, by
fixing the attribute Legal form’s values. Since many records were available, the z-score
was used once again, this time combined with the Bonferroni correction. The results
are reported in table 2. It is visible that the hypothesis for active and failed companies
with Legal form equal to ”Others” to have the same mean on the Age attribute is not
rejectable. The same can be said for companies having Legal form equal to ”S.P.A.”. The
companies having other Legal form values have very low p-values and are all rejected.
It is worth to evidence that the value ”Consortium” presents a difference between the
means with 99.2% confidence, ranging from -9.3 to -5.4 and this strongly deviates from
the previously seen general value. This indicates that, in this case, active companies are
older than failed ones. The same analysis was performed by fixing the values on the basis
of the macrocategory of the ATECO industrial sector. In table 3 the test results are
reported, conducted once again through a z-test and with the Bonferroni correction. This
time, the hypothesis of same mean were all rejected, presenting very low p-values.
4AGE – 2016
Failed
Active
Difference
Statistical significance
Legal Form
Mean
Mean
99.3% CI
p-value
S.R.L.
11.81
9.14
[2.44 ; 3]
<2.2e-16
S.C.A.R.L.
11.09
9.18
[1.15 ; 2.67]
1.576e-11
S.R.L.
one-person
12.03
10.34
[1.11 ; 2.27]
3.043e-15
S.P.A.
25.19
23.59
[-3.03 ; 6.22]
0.353
S.R.L.
simplified
1.26
1.01
[0.2 ; 0.3]
<2.2e-16
Other
13.9
15.1
[-3.84 ; 1.35]
0.1978
Consortium
12.65
20
[-9.3 ; -5.37]
<2.2e-16
Table 2: Z-test between distribution of Age in 2016 with a fixed Legal Form
AGE – 2016
Failed
Active
Difference
Statistical significance
ATECO
Mean
Mean
99.5% CI
p-value (alpha = )
G
10.01
6.56
[3.04 ; 3.87]]
<2.2e-16
C
14.07
9.05
[4.24; 5.80]
<2e-16
I
6.4
5.2
[0.67 ;1.73]
2.004e-10
F
12.46
8.74
[3.21 ; 4.20]
<2.2e-16
N
8.17
5.77
[1.79 ; 3.03]
<2.2e-16
Others
9.5
8
[0.95 ; 2.13]
2.22e-13
H
9.18
6.4
[1.87 ; 3.71]
<2.2e-16
J
9.29
7.78
[1.2 ; 2.82]
2.904e-12
L
16.5
15.4
[0.23 ; 1.91]
0.0003418
M
9.1
7.00
[1.51 ; 2.66]
<2.2e-16
Table 3: Z-test between distribution of Age in 2016 with a fixed Ateco value
- Size
Since it deals with a categorical attribute with 6 distinct values, they were transformed
into discrete values ranging from 0 to 5. A statistical test was conducted in order to check
if the distribution of Size was the same for active and failed companies. It was used the
Pearson’s Chi-squared test, which rejected the hypothesis. It was therefore checked if the
mean was the same for failed and active companies; once again, the z-test rejected the
hypothesis. It is possible to claim that the means’ difference between failed and active
5companies ranges from -0.11 to -0.07 with a 95% confidence, suggesting a small shift of
the active companies towards a bigger size. As with Age, it was checked whether the
distributions changed on the basis of the Legal form values. For every distinct value it
was performed a Bonferroni-corrected z-test where the H0 hypothesis states that the mean
of the two distributions is the same; the results are shown in table 4. From the results it
is not possible to reject the hypothesis for active and failed companies presenting Legal
form equal to ”S.C.A.R.L.”, ”S.P.A.”, ”Other” or ”Consortium” to have the same mean.
The same test was finally performed by fixing the ATECO industrial sector value with
values reported in table 5. For the categories G, C, and N it is not possible to reject the
null hypothesis. For category H it is visible an increase of the mean for failed companies
with difference confidence intervals ranging from 0.02 to 0.38 (99.5%).
SIZE – 2016
Failed
Active
Difference
Statistical significance
Legal Form
Mean
Mean
99.3% CI
p-value
S.R.L.
0.82
0.97
[-0.17 ; -0.12]]
<2.2e-16
S.C.A.R.L.
1.49
1.48
[-0.09 ; 0.11]
0.78
S.R.L.
one-person
0.79
0.91
[-0.2 ; -0.05]
8.286e-06
S.P.A.
1.72
2.09
[-0.93 ; 0.18]
0.07
S.R.L.
simplified
0.66
0.79
[-0.18 ; -0.08]
3.909e-11
Other
0.9
0.89
[-0.24 ; 0.27]
0.90
Consortium
0.33
0.3
[-0.09 ; 0.14]
0.60
Table 4: Z-test between distribution of Size in 2016 with a fixed Legal Form
6SIZE – 2016
Failed
Active
Difference
Statistical significance
ATECO
Mean
Mean
99.5% CI
p-value
G
0.82
0.85
[-0.075 ; 0.017]]
0.07875
C
1.4
1.44
[-0.13; 0.05]
0.204
I
1.33
1.45
[-0.2 ;-0.03]
0.000111
F
0.72
0.87
[-0.2 ; -0.095]
<9.978e-15
N
1.21
1.25
[-0.17 ; 0.08]
0.3063
Others
0.77
1
[-0.3 ; -0.15]
2.22e-16
H
2.02
1.81
[0.02 ; 0.4]
0.0017
J
0.66
0.75
[-0.18 ; 0.003]
0.006451
L
0.18
0.2
[-0.07 ; 0.004]
0.01378
M
0.5
0.55
[-0.14 ; -0.004]
0.002936
Table 5: Z-test between distribution of Size in 2016 with a fixed Ateco value
5 Task B
In this Task, they were compared the distributions of Age and Size of the failed companies
in two distinct years. The choice of the rears was based on the 2008 crisis: it was decided to
compare years 2009 (the first following the crisis, and therefore strongly affected by the crisis)
and 2016 (which was a long way from the crisis).
- Age
In figure 3 it is reported the density curve for Age in the two years.
Figure 3: Density estimation
Figure 4: Boxplot on Age : 2009 vs 2016
The pvalue was obtained through the Kolmogorov-Smirnov test, which refuses the hy
pothesis that the two curves come from the same distribution. Also the z-score pvalue
7which checks for the same mean has very low values and refuses the hypothesis. The dif
ference on average between 2009 and 2016 presents confidence intervals of 95% equivalent
to -2.02 and -1.75, suggesting a higher age for the failed companies in 2016. Through
the Cullen and Frey graph it was tried to figure out whether the data of 2009 and 2016
derived from a well-known distribution (figure 5 and 6).
Figure 5: Cullen and Frey graph – 2009
Figure 6: Cullen and Frey graph – 2016
From the graph, it seems the data of failed companies in 2009 are far away from such
distributions. The data of the 2016 failed companies suggest the following possible distri
butions: exponential, Pareto, Gamma. The Kolmogorov-Smirnov tests performed after
calculating the parameters with the Maximum Likelihood Estimation refuse all the three
previously hypothesized distributions. In this case too, it was checked if the distributions
change on the basis of the Legal form attribute. However, in this case it was necessary
to delete from the analysis the values ”S.R.L.”, ”Simplified” and ”Others”. The first
was deleted, because such formed-companies only exist from 2012 and the second and
third, because it presents very few records. Table shows the results of the z-test for the
mean with Bonferroni correction. The only non-rejectable hypothesis for the mean is the
”S.C.A.R.L.” one (table 6).
The same analysis was also conductedby selecting the location value of the Area attribute.
For all the four values the z-test rejects the same-mean hypothesis (table 7).
8AGE – 2009,2016
Failed
Active
Difference
Statistical significance
Legal Form
Mean
Mean
99% CI
p-value
S.R.L.
11.82
9
[2.68 ; 3.11]
<2.2e-16
S.C.A.R.L.
11.09
11.23
[-0.83 ; 0.56]
0.6091
S.R.L.
one-person
12.03
7.5
[4.11 ; 5]
2.2e-16
S.P.A.
25.19
18.2
[3.77 ; 10.19]
2.16e-8
Consortium
12.65
8.11
[3.29 ; 5.8]
<2.2e-16
Table 6: Z-test between distribution of Age in 2009 and 2016 with a fixed Legal Form
AGE – 2009,2016
Failed
Active
Difference
Statistical significance
Area
Mean
Mean
98.75% CI
p-value
Sud
10.9
8.42
[2.15 ; 2.83]
<2.2e-16
Nord Ovest
12.77
9.6
[2.82 ; 3.6]
<2.2e-16
Nord Est
12.23
9.14
[2.7 ; 3.5]
<2.2e-16
Centro
11.6
9
[2.26 ; 3]
<2.2e-16
Table 7: Z-test between distribution of AGE in 2009 and 2016 with a fixed Area
- Size
As requested by the task, the Size distributions were also examined for failed companies
in 2009 and 2016. In figure 7 it is reported the bar plot.
Figure 7: Bar plot of Size
9In order to check if they had the same distribution on Size, it was performed the Pearson’s
Chi Squared test. The hypothesis is rejected with a very low pvalue. The z-test for
testing the similarity of the means gets rejected. In table 8 they are reported the z-score
results with the Bonferroni correction, fixing the Legal status value. All the equal-mean
hypothesis are rejected. The same approach was followed for Area; in this case, we found
out that for the value ”Nord-Est” it is not possible to refuse the hypothesis of having the
same mean for the two distributions (table 9).
SIZE – 2009, 2016
Failed
Active
Difference
Statistical significance
Legal Form
Mean
Mean
99% CI
p-value
S.R.L.
0.82
0.7
[0.1 ; 0.15]]
<2.2e-16
S.C.A.R.L.
1.49
0.85
[0.54 ; 0.73]
<2.2e-16
S.R.L.
one-person
0.79
0.94
[-0.2 ; -0.09]
1.25e-09
S.P.A.
1.71
2.01
[-0.76 ; -0.015]
0.077248
Consortium
0.33
0.14
[0.084 ; 0.28]
1.72e-06
Table 8: Z-test between distribution of Size in 2009 and 2016 with a fixed Legal Form
SIZE – 2009,2016
Failed
Active
Difference
Statistical significance
Area
Mean
Mean
98.75% CI
p-value
Sud
0.97
0.8
[0.11 ; 0.21]
<2.2e-16
Nord Ovest
0.86
0.77
[0.05 ; 0.14]
8.017e-08
Nord Est
0.8
0.77
[-0.01 ; 0.08]
0.05511
Centro
0.9
0.7
[0.16 ; 0.24]
<2.2e-16
Table 9: Z-test between distribution of SIZE in 2009 and 2016 with a fixed Area
6 Task C
In this Task it was requested to analyze the distribution of the probability conditioned to the
failure of a company on Age and Size for a given year. Once again, it was chosen the year 2016,
since it has a good balance between the number of failed and active companies.
- Age
Since Age is a continuous attribute, it was divided into bins, so as to compute the con
ditional probability for every bin. The division was done by using quantiles, to have a
10similar number of data in each bin. The bar plot with conditional probability of failure
for the five bins is reported in figure 8.
Figure 8: Probability of failure on Age Bins in 2016
From the graph it is visible how the conditional probability of failure grows as the company
Age increases, reaching 63% for bin [16-114].
In order to check in the probability of failure on every age bin was equal to the probability
of failure for the whole year 2016, a series of Bonferroni-corrected binomial tests was also
performed. All the tests were rejected.
The binomial test was subsequently tested, this time failure for also fixing the values on
Legal form, Area and ATECO. In all three cases, the various conditional probabilities
were tested against the conditioned probability on Age, without fixing the latter features.
Hypothesis H0 always claims that the conditional probabilities are equal. The results are
shown in the three following tables 10, 11 and 12.
AGE BINS
[0,1]
[2,4]
[5,9]
[10,15]
[16,114]
LOCATION
p.value
p.value
p.value
p.value
p.value
SUD
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
CENTRO
0.001208
1.392e-09
2.534e-11
<2.2e-16
6.125e-10
NORD-EST
4.263e-15
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
NORD-OVET
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
Table 10: Binomial test on Probability of failure of Age Bins with a specific Area vs Probability
of failure of Age Bins in general
11AGE BINS
[0,1]
[2,4]
[5,9]
[10,15]
[16,114]
LEGAL FORM
p.value
p.value
p.value
p.value
p.value
S.P.A.
0.001253
0.09874
0.0992
0.07762
0.003121
Other
5.438e-09
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
Consortium
0.2525
0.02351
0.01707
0.07091
<2.2e-16
S.R.L. one-person
0.1722
<2.2e-16
4.65e-06
5.314e-06
7.597e-11
S.R.L. simplified
<2.2e-16
0.0001072
NULL
NULL
NULL
S.C.A.R.L.
0.0001283
3.8e-09
0.8836
0.3995
0.4708
S.R.L.
<2.2e-16
0.002284
0.8416
0.1066
2.257e-10
Table 11: Binomial test on Probability of failure of Age Bins with a specific Legal Form vs
Probability of failure of Age Bins in general
AGE BINS
[0,1]
[2,4]
[5,9]
[10,15]
[16,114]
ATECO
p.value
p.value
p.value
p.value
p.value
C
0.0007406
0.189
8.933e-07
1.941e-06
<2.2e-16
F
<2.2e-16
1.725e-11
4.895e-06
0.001849
0.244
G
0.2948
0.1222
6.089e-05
0.001128
2.165e-09
H
0.126
0.5034
0.3156
0.1682
0.9611
I
0.000202
6.751e-11
0.0007197
1.34e-10
4.375e-12
J
8.816e-16
8.683e-07
0.004909
0.0172
0.01165
L
1
0.001559
0.001717
0.02664
<2.2e-16
M
1.798e-11
2.003e-12
4.169e-09
5.372e-05
7.227e-06
N
4.598e-05
0.1165
0.1446
0.4079
0.0008334
OTHERS
0.4926
0.2662
0.004474
0.02425
1.145e-10
Table 12: Binomial test on Probability of failure of Age Bins with a specific Ateco vs Probability
of failure of Age Bins in general
With respect to Area, all tests concerning a fixed value for Area reject H0. however,
by fixing Legal Form and Ateco some hypotheses cannot be rejected. In particular, in
Legal Form the values”S.C.A.R.L.” and ”S.P.A.” present non-rejectable p-values by con
ditioning on three out of five age bins. In ATECO the value ”H” actually presents high
p-values for every bin: it is not possible to ever refuse the hypothesis that the conditioned
probability is equal to the one obtained by not conditioning on a specific ATECO value.
12• Size
As for Age it was analyzed the conditioned probability based on Size for companies in
- Barplot in figure 9 shows the various probabilities.
Figure 9: Probability of failure on Size in 2016
It is observable how the failure probability is higher for ”Single” and ”ExtraLarge” com
panies. Also in this case, the binomial test was performed between a company failure
probability based on Size and the general 2016 failure probability. All the hypotheses to
have the same value were rejected.
It was also tested to fix the values on Legal form, Area and ATECO and perform again
the binomial tests, where H0 is equal to having the same failure probability without fixing
the previous values. The results are reported in table 13, 14 and 15.
SIZE
SINGLE
MICRO
SMALL
MEDIUM
LARGE
EXTRA LARGE
AREA
p.value
p.value
p.value
p.value
p.value
p.value
SUD
<2.2e-16
<2.2e-16
<2.2e-16
2.743e-10
3.094e-06
9.771e-06
CENTRO
<2.2e-16
2.342e-06
3.686e-07
5.815e-05
0.01512
0.7534
NORD-EST
<2.2e-16
<2.2e-16
5.358e-12
1.574e-05
0.003145
0.05016
NORD-OVET
<2.2e-16
<2.2e-16
<2.2e-16
2.598e-15
9.185e-06
0.00524
Table 13: Binomial test on Probability of failure of Size with a specific Area vs Probability of
failure of Size in general
13SIZE
SINGLE
MICRO
SMALL
MEDIUM
LARGE
EXTRA LARGE
LEGAL FORM
p.value
p.value
p.value
p.value
p.value
p.value
S.P.A.
1.318e-08
0.001076
0.04028
0.1452
1
0.2001
Other
<2.2e-16
<2.2e-16
<2.2e-16
6.584e-11
1.287e-10
6.169e-06
Consortium
3.198e-10
0.1176
0.3747
0.6489
0.6818
0.407
S.R.L. one-person
<2.2e-16
2.319e-14
3.588e-08
0.0008439
0.0006375
0.02452
S.R.L. simplified
<2.2e-16
<2.2e-16
<2.2e-16
<2.2e-16
1.128e-09
1.323e-05
S.C.A.R.L.
0.004555
0.01157
0.2273
0.7973
0.3272
0.2219
S.R.L.
<2.2e-16
<2.2e-16
9.97e-07
0.009156
0.4459
0.8858
Table 14: Binomial test on Probability of failure of Size with a specific Legal form vs Probability
of failure of Size in general
SIZE
SINGLE
MICRO
SMALL
MEDIUM
LARGE
EXTRA LARGE
ATECO
p.value
p.value
p.value
p.value
p.value
p.value
C
<2.2e-16
1.203e-08
4.418e-06
0.0001484
3.838e-09
0.004969
F
0.2374
<2.2e-16
5.95e-08
0.2088
0.8168
0.4852
G
0.6734
2.05e-08
1.538e-07
0.000583
0.08204
0.6684
H
0.4685
0.0005254
0.4271
0.01033
0.4539
0.7941
I
<2.2e-16
1.791e-05
2.272e-09
1.947e-12
5.909e-08
0.05195
J
1.883e-10
2.821e-05
0.006281
0.0007734
0.8397
0.01041
L
5.773e-06
0.06327
0.4049
0.7959
0.2479
0.8188
M
3.419e-16
4.105e-08
6.845e-05
0.3947
0.2127
0.1923
N
0.09218
0.0674
1
0.7233
0.5625
0.5653
OTHERS
8.515e-05
0.5478
0.0007689
6.708e-06
1.194e-09
0.002218
Table 15: Binomial test on Probability of failure of Size with a specific Ateco vs Probability of
failure of Size in general
From these three table it is possible to observe as for the values Consortium e S.C.A.R.L.
in Legal Form there are some high pvalues. The same thing is observed in Ateco for the
category H, L and N. In Area all the H0 hypotesis are rejected.
147 Task D
7.1 In-depth Preprocessing
The first step was to remove irrelevant features, such as Tax Code Number, Company name
and File. Subsequently, since in every financial indicator only the values concerning the last
three years were reported, all the companies younger than three were deleted. Leaving them
would have involved several missing values. It was then decided to exploit the values of every
economical indicator to compute two derived columns. For every index, the average in the three
years (two years or even just one in case of missing values) and the trend were calculated. The
latter takes the difference between the value of the last year and the value of the two previous
years (one in case of missing values) and is normalized by dividing by the mean. By analyzing
the data at hand, it was decided to only keep years from 2007 to 2018, since the records were
unbalanced in the other ones. Furthermore, all the features presenting more than 50% missing
values were deleted. Otherwise, the analysis could have been incorrect.
More tests were also performed in order to check the independence of the variables with
respect to the target. Since the target is a discrete value, the continuous features were discretized
into bins to perform the Chi Square test. It was not possible to reject the hypothesis for the
attribute Total assets turnover (times) trend, which was then eliminated.
Then, multicollinearity was checked. Such a phenomenon can cause problems with logistic
regression. After rescaling the data, it was then realized a model with logistic regression, which
permitted the analysis of Variance Inflation Rate. Cash Flowth EUR mean presented a much
higher value with respect to the other features and was therefore eliminated.
Finally, to finish feature selection it was runned the Akaike Information Criterion algorithm,
which eliminated 16 features.
At this point, the dataset was divided into training (TR) and test set (TS); the TR takes
all the records ranging from 2007 to 2017 in the variables Last accounting closing year, while
the TS takes the records with year 2018. The data was scaled on the basis of the values of the
TR and outliers were eliminated only on TR, in order not to influence the TS. The features Age
and Current liabilities/Tot ass.% trend were eliminated, as they presented too many outliers.
Finally, all the rows still presenting missing values were dropped and were balanced on the
basis of the target value in TR and TS, by randomly eliminating some records of the target
dominant value.
The final result is a TS composed by 88126 records equally divided between active and
failed companies with 22 features and a test set with the same number of features and 28880
balanced records.
7.2 Logistic Regression
At this point it was fit a Linear Regression Model by using the library caret. It was performed
a 10-Fold Cross Validation repeated 5 times. The chosen scoring metric was the AUC.
15COEFFICIENTS
Estimate
Std. Error
(Intercept)
76.04753
7.87144
Cash Flowth EUR mean
-433.86489
49.95201
Current liabilities/Tot ass.% mean
2.14634
0.08731
Current ratio mean
0.88717
0.11323
EBITDA/Vendite% mean
-3.32183
0.22185
Interest/Turnover (%)% mean
0.43093
0.11042
Leverage mean
2.49336
0.59017
Liquidity ratio mean
0.75959
0.12018
Net financial positionth EUR mean
66.83668
42.38245
Number of employees mean
436.69347
50.92963
Profit (loss)th EUR mean
-142.95797
36.21179
Return on asset (ROA)% mean
-5.98936
0.36110
Return on equity (ROE)% mean
-0.09139
0.06986
Return on sales (ROS)% mean
-0.80479
0.05863
Solvency ratio (%)% mean
-0.86904
0.05927
Total assets turnover (times) mean
1.00450
0.04619
Interest/Turnover (%)% trend
0.34815
0.03012
Liquidity ratio trend
-0.08779
0.05420
Total assetsth EUR trend
-4.04397
0.06512
AreaCentro
0.04362
0.01759
AreaNord Est
0.58536
0.02044
AreaNord Ovest
0.67797
0.01948
ATECO CATF
-0.31417
0.02512
ATECO CATG
-0.21976
0.02408
ATECO CATH
-0.30546
0.03836
ATECO CATI
-0.67778
0.03295
ATECO CATJ
-0.22310
0.03688
ATECO CATL
-0.44379
0.03263
ATECO CATM
-0.17066
0.03203
ATECO CATN
-0.31797
0.03436
ATECO CATOthers
-0.50303
0.02970
Legal form‘Other‘
-2.22915
0.08429
Legal form‘S.C.A.R.L.‘
-0.01949
0.06287
Legal form‘S.P.A.‘
0.51846
0.10146
Legal form‘S.R.L.‘
-0.02129
0.05896
Legal form‘S.R.L. one-person‘
0.13272
0.06103
Legal form‘S.R.L. simplified‘
-0.91977
0.07449
Table 16: Logistic regression Coefficients estimation
16The estimated coefficients values are reported in table 16. Since the coefficient of the
logistic regression represent the log odds ratios, a positive one shows that as the value of the
independent variable increases, the mean of the dependent variable also tends to increase and
viceversa. It is clear how some coefficients present high values. In particular, the feature
Number of employees means presents a coefficient with value 436.69, which is visibly large; this
indicates that the feature weighs much more than the others.
7.2.1 Test set
In table 17 is reported the confusion matrix obtained by passing the TS to the model. The
value of accuracy is equal to 0.65 (with confidence interval at 95% equal to 0.6418 and 0.6528).
It is also reported in figure 10 the calibration curve that shows the no perfect calibration of
the model. It tends to assign a probability of failure greater than expected, in the range [0.2,
0.6] e lower in the range [0.9, 1].
Reference
Prediction
Active
Failed
Active
8699
4445
Failed
5741
9995
Table 17: Confusion Matrix Logistic Regression on test set
Figure 10: Calibration Logistic Regression
Figure 11: Probability density of failure be
tween Failed (Class) 1 and Active company
(class 0)
It was then performed the Wilcoxon test, which lets us understand that on average, the
predicted probabilities of failure are between 0.154 and 0.163 with a 95% confidence interval.
From figure 11 the shift between the two classes is visible.
177.3 Random Forest
A Random Forest Model was constructed by using the same dataset as before. Also in this
case, a 10 Fold Cross-Validation was performed 5 times. The grid-search approach was also
performed in order to set the parameter for the number of attributes to select, testing values
from 5 to 25 every 5. For the model selection 20 trees were used.
In table 18 it is reported the confusion matrix computed on the TS with Random Forest.
The accuracy value is 0.69 with 95%confidence intervals between 0.686 and 0.698.
Reference
Prediction
Active
Failed
Active
10597
5071
Failed
3843
9369
Table 18: Confusion Matrix Of Random Forest
In figure 12 it is reported the calibration curve performed on the TS. The improvement with
respect the the same curve shown for the logistic regression is evident. Also the Wilcoxon test
underlines the improvement: now the distance on the mean between the probability of failure
of the 2 class is in the range [0.20006, 0.24999]with 95% of confidence (figure 13). In figure 14
is showed the Roc Curve of the Random forest.
Figure 12: calibration Random Forest
Figure 13: Probability density of failure be
tween Failed (Class) 1 and Active company
(class 0)
18Figure 14: Roc Curve Rrandom Forest
7.4 Model Comparison
In order to compare the two classification models, it is created a dataframe with the values of
AUC got from the previous analysis. For each classificator there are 50 records. The box plot
in figure 15 show the quantile of the data.
Figure 15: Box Plot Logistic Regression and Random Forest
In order to execute a statistical test to compare the models, it was tested the normality
with the Shapiro test. In both the cases it is not possibile reject the H0 hypothesis. So a t-test
was applied with hypothesis of same mean: the pvalue very low suggest to reject H0.
7.5 Rating
Besides the companies probability of failure prediction thanks to the logistic regressor, it was
also decided to set a label on the basis of the default risk. Starting from range [0,0.1] to
1 increasing by 0.1, 10 labels from A to J were assigned. The final model shows that the
companies with rating equal to A are the most secure ones, while those with value J face a
default risk.
19It was also addressed a binomial test with the Bonferroni correction for every rating value.
The TS was divided on the basis of the rating and the H0 hypothesis was tested: it claims that
the true probability in every group is less than or equal to the upper limit of the rating class.
Rating
Threshold
P-VALUE
A
0.1
1
B
0.2
1
C
0.3
1
D
0.4
1
E
0.5
1
F
0.6
1
G
0.7
1
H
0.8
1
I
0.9
1
J
1
1
8 Task E
Task E required to study a selective classification for the logistic regression model realized in
Task D. For this purpose it was created a function which tests the best constraining-values for
the model to abstain the value prediction.
In figure 16 it is possible to see on the y axis the error calculated from the accuracy and on
the x axis the coverage of our TS. Indeed, by abstaining from some uncertain predictions, the
coverage of the predicted data.
Figure 16: Error – coverage curve
20







