Data Analytics on Transport and Infrastructures
Roads and Streets in Dublin City Council
1- Introduction
Transport and infrastructure is one of the most sensitive department for the safe and maintainable development in any city or county. Our this analysis will focus on the roads and streets department of Dublin City Council (DCC). We will figure out potential problems in the dataset released by DCC on roads and streets information and then analyze the insights of dataset through descriptive statistical analysis and visualization which leads to a potential solution for the problems through machine learning techniques and then compare the results and findings of our analysis.
2- Description of Data
Details on over 4,000 roads and streets in Dublin City. Dataset compiled by the Roads and Traffic Division Dublin City Council listing 4772 roads, street, lanes and bridges in the Dublin City council administrative area. Details include alphabetical listing of road names; street class (1-National primary, 2 National secondary, 3-Regional 4–7 local road); New area is the five administrative areas within Dublin City Council -1-Central, 2-North West, 3-North West, 4-South Central and 5-South East. Surface types are 1-asphalt, 2-concrete, 3-macadam, 4-setts, 5-flags. Electoral area, line length, national route no(if applicable), year built, Ordnance survey sheet no, road start, road finish and Irish translation of street names. And then there is a target column with the information of road/street in-charge with binary outcomes of Y/N . So the final shape of Dataset is 4771 * 14 showing that it contains 4771 records of streets with 14 features of each record.
You can see the details of dataset on following link : https://data.gov.ie/dataset/dublin-city-roads-and-streets?package_type=dataset
3- Problem Statement
After reading and analyzing the information of Roads and Streets dataset given by DCC we can see that given the information about every street in each administrative area along with different classes and surface types of streets we can figure out and predict whether the road is in-charge or not. As we already know what to find or predict from this dataset, we came to know that this is Supervised learning based ml problem.
As the target outcome has a discrete values of Y and N we can clearly see that this is a Classification Problem of machine learning to predict the discrete outcome. Since the outcome is in binary number so to be more specific we can say that this is a binary classification problem. We will figure out the potential classifiers to solve this problem with the best way possible.
4- Statistical Analysis of Data
We will perform different statistical analysis techniques on this given dataset to find out useful insights of data which leads us to the potential solution of the problem. Mainly our statistical analysis will comprises on three major techniques.
1. Descriptive Analysis
2. Discrete Distribution
3. Normal Distribution
4.1- Descriptive Analysis
We will find out the following initial stats shown in table 1 on each continuous variable/feature of data columns to find out the potential ranges and variance/spread in dataset.
Line Length Os Sheets
Counts 4595 4163
Mean 248.608923 322220.357090
Standard Deviation 319.085950 6684.623023
Min 0.000000 306324.000000
25% 81.000000 313321.000000
50% 158.000000 319916.000000
75% 301.000000 326402.000000
Max 6422.000000 336421.000000
4.1.1 Visualize Street Classes
After analyzing street codes column in dataset we figured out that there are 8 unique street codes are there with more than 60% records are from Street code 6 which referred as Local Roads and only 0.003% records are from street code 1.5 which referred as National Primary streets. As there are 4768 total records of street codes and 3014 of them are from street code 6, which shows us that 3 records were missing in Data. Bar chart shown below describes the ratio of street codes spread in given data.
4.1.2 Visualize Surface types of streets
After analyzing surface types codes column in dataset we figured out that there are 5 unique surface type codes are there with more than 80% records are from surface codes 1 and 2 which referred as Asphalts and Concrete. As there are 4769 total records of surface types and 3849 of them are from surface code 1 and 2, which shows us that 2 records were missing in Data. Bar chart shown below describes the ratio of surphace codes spread in given data.
4.1.3 Visualize Length of Roads
Following scatter plot shows us the trend and spread of line_length column of dataset represent the lengths of roads and streets records in given dataset.
4.2 Discrete Distribution
As our data has the mixture of both discrete and continuous variables but because of the target outcomes range is binary, most appropriate discrete distribution which gives us the useful insights for our dataset is Binomial distribution because it shows us the probability distribution of the discrete values for each trial in dataset with respect to the probability of binary outcome of success. We calculate binomial distribution on new area code representing the administrative areas with the the probability of success i.e road in-charge is 88%.
# Find Binomial Distribution of new_area_code , with the probability of success is 0.88
y=Data[‘new_area_code’]
# setting the values
# of n and p
p = 0.88
# initializing the list of r values with range [min-max]
r_values = list(range(int(y.min()),int(y.max())))
# Set length (n) according to r-values
n = len(r_values)-1
# obtaining the mean and variance
mean,var = binom.stats(n, p)
# list of pmf values
dist = [binom.pmf(r, n, p) for r in r_values ]
Following are the results of binomial distribution on administrative area code. Which represent the peak of distribution is at r value of 3 representing the area of North-West of Dublin.
r p(r)
1 0.038015999
2 0.278784
3 0.681472
4 0.0
4.3 Normal Distribution
Normal distributions represent the spread of data over the distribution with the peak at mean 0 and almost 99% of data lies between three standard deviations according to empirical rule of normal distributions. We choose Z-Score as a normal distribution to find the spread of three discrete columns of data, that are [‘new_area_code’,’street_class_code’,’surface_type_code’]. Representing to corresponding name against each code. In figure below some of the values of Z-score are displaying within the range of -3 to+3.
5- Data preparation and Visualization
Data preparation and visualization is an important phase of building a machine learning model with quality data and useful information to learn the hidden patterns and trends in data through visualization of features and variables with tables, graphs and charts. Since we have been working on roads and streets dataset of Dublin city council, we have multiple variables/feature with the mixed set of discrete and continuous outputs. Some of features are useless for us in predicting the possible outcome and some of them also contains missing value which are important for us to fill accurately so that it can’t affect our performance on learning the model with these variables.
5.1 Descriptive statistical visualization
Table given below shows us the the mathematical distributions of each features with the minimum , maximum , average, spread in the form of standard deviation and most importantly counts for each feature, which represent that how many data records are missing in each feature. As total street records are 4771 and many features have missing values as compare to this number.
5.2 Data Preparation and Analysis
From the descriptive analysis we have seen above, we can see that many of features have the missing records representing NAN in the dataset and these none values will cause trouble when we pass this data through any machine learning model’s pipeline. So we need to prepare these columns by filling their none or missing records efficiently.
5.2.1 Prepare Target column
In the process of preparation we will first fill, drop and update records of target column in-charge. We find out the unique values in target column and we were expecting to be the values of “Y” and “N” only but surprisingly we find out 5 unique records in target column. Table below shows us the unique values of target column and their corresponding counts in column.
Unique Value Count
‘ Y’ 1
‘1’ 2
‘Y’ 4226
’N’ 538
‘y’ 2
None 2
Total 4771
We need to convert target columns to the standard binary outcome values of Y and N only, so we iterate over each value and convert ‘ Y’, ‘y’ and ’1’ into ‘Y’.
# Create only binary target values, by renaming to Y,N
for v in range(len(Data[‘in_charge’])):
if Data[‘in_charge’].iloc[v] ==’ Y’:
Data[‘in_charge’].iloc[v]=’Y’
elif Data[‘in_charge’].iloc[v] == ‘y’:
Data[‘in_charge’].iloc[v]=’Y’
elif Data[‘in_charge’].iloc[v] == ‘1’:
Data[‘in_charge’].iloc[v]=’Y’
Now because there are total 4771 rows in the dataset and target column has total values of 4679 that indicates us that 2 records are missing the target information. So we drop those two records having missing record of the outcome value.
So after preparation of a target column, our final target column looks like this.
Unique Values Counts
Y 4231
N 538
Total 4679
5.2.2 Prepare Discrete Features / Columns
As we prepare the target column by finding out the unique outcomes of feature and then compare it with required outcome of same column and then transform the feature according to the required outcome. Since the target column was also the discrete column, but we weren’t fill the none records of target columns because of the sensitive information that could mislead or misguide the machine learning model to predict the correct outcome accurately. But that same case won’t apply here because these features won’t pass into machine learning model with none value and because of the number of none records in these discrete features we cannot remove all the records having none value, that will eventually lead us towards very short dataset. So we need to fill the none records of these features and one of the most effective technique to fill none records of discrete features is to fill the missing records of column with most frequent value of column, mathematically which represent the Mode of the column.
Subarea column has 349 missing values in it with the most frequent element in it is 10, so we replace all of the missing records of this feature with mode value of this feature.
Subarea EA column has 363 missing values in it with the most frequent element in it is 05, so we replace all of the missing records of this feature with mode value of this feature.
Street class code column has 3 missing values in it with the most frequent element in it is 6.0 representing the Local-Regional Roads , so we replace all of the missing records of this feature with mode value of this feature. We also figure out the unique values and their corresponding counts in the column. Table shown below give us more details on it.
Unique Values Count
‘1.0’ 84
‘1.5’ 15
‘2.0’ 262
‘3.0’ 296
‘4.0’ 456
‘5.0’ 260
‘6.0’ 3013
‘7.0’ 382
None 3
New area code column has 0 missing values in it with the most frequent element in it is 5 representing the South-East Region. We also figure out the unique values and their corresponding counts in the column. Table shown below give us more details on it.
Unique Values Count
‘1’ 1060
‘2’ 500
‘3’ 1025
‘4’ 928
‘5’ 1248
Surface type code column has 2 missing value in it with the most frequent element in it is 1.0 representing the Asphalt type Roads , so we replace all of the missing records of this feature with mode value of this feature. We also figure out the unique values and their corresponding counts in the column. Table shown below give us more details on it.
Unique Values Count
‘1.0’ 1926
‘2.0’ 1923
‘3.0’ 817
‘4.0’ 99
‘5.0’ 4
Nan 2
Table shown below shows us the summary of process and outcomes we get from the preparation part of discrete columns.
Column/Feature Name # Of Missing Records Most frequent value Representation
Subarea 349 10 -
Subarea EA 363 05 -
Street Class 3 6.0 Local Road
Admin Area 0 5 North-East area
Surface type of road 2 1.0 Asphalt
After complete preparation of filling the none / missing records of five discrete features of dataset now we can find a balance and common platform in which all of these features are capable enough to break into the pipeline of any machine learning model without any error. Now all of these discrete feature have common count of records in it. Now all these discrete features has the same number of count in it as there are in target columns. So count of all these features are 4769 with there corresponding target/outcome value of each record.
5.2.3 Prepare Continuous Features / Columns
In our dataset initially we had fourteen (14) different columns representing features of dataset with corresponding binary outcomes and most of these features had discrete set of values in it, so we filled out the missing values of necessary discrete columns with the mode of corresponding feature. But in the case of continuous variable we don’t have an option of finding most frequent value from
feature because of infinite range of feature value in it. So for handling the missing records of continuous variable we will figure out the mean/average value of each feature and then fill out all the missing records of continuous features with their corresponding mean value. We have mainly two continuous feature in our dataset having the name of line_length represent the length of street or road mentioned in the given record of dataset and other column is with the name of os_sheets representing the number of sheets lengths corresponding to each road and street.
line_length column has 174 missing values in it with the mean value is 248.6089, so we replace all of the missing records of this feature with mean value of this feature and now there are no missing records in line_length column.
os_sheets column has 606 missing values in it with the mean value is 322220.357090204, so we replace all of the missing records of this feature with mean value of this feature and now there are no missing records in os_sheets column.
Table shown below shows us the summary of process and outcomes we get from the preparation part of continuous columns.
Column/Feature Name # Of Missing Records Mean value
line_length 174 248.6089
os_sheets 606 322220.357090204
After complete preparation of filling the none / missing records of two continuous features of dataset now we can find a balance and common platform in which all of these features are capable enough to break into the pipeline of any machine learning model without any error. Now all of these continuous feature have common count of records in it. Now all these continuous features has the same number of count in it as there are in target columns. So count of all these features are 4769 with there corresponding target/outcome value of each record.
5.2.4 Features Engineering
After preparing all the necessary columns by filling out their missing records with the appropriate feature value, we will now figure out those data variables/features which won’t lead us to any help in predicting our target outcome from machine learning model. Below mentioned are few columns which we didn’t explore yet in terms of preparation or filling out missing records of it.
• Street Name
• Year_built
• Road_start
• Road_finish
• Irish
• Route no (if applicable )
All of these features represent the almost unique value for each record and some of them e.g year_built and route_no columns have more than 70% missing records which obviously won’t help us in any condition to predict our target outcome and other columns like street name , start ,finish and its Irish translation had a unique record for each street and road which also have nor correlation with out binary target outcome. So we will drop/delete these features from dataset. Hence the shape of our final data set is Shape of Data (4769, 7).
5.3 Exploratory Data Analysis
Table given below shows us the updated statistical description of dataset after preparation of all different columns of data with both discrete and continuous nature.
5.3.1 Visualize percentage of instances of each feature per Target outcome
We visualize the percentages of each discrete feature showing the corresponding ratio of target outcomes of whether the street/road in-charge or not .
Area Administration: Pie chart showing below gives us the information of instances of area administrative code ratio as per target outcome of Y/N. We can see that 88 % record of area administrative code gives us the outcome of Y.
Street Classes: Pie chart showing below gives us the information of instances of street class code ratio as per target outcome of Y/N. We can see that 87 % record of street class code gives us the outcome of Y.
Surface Types : Pie chart showing below gives us the information of instances of Surface types code ratio as per target outcome of Y/N. We can see that 92 % record of Surface types code gives us the outcome of Y.
5.3.2 Prepare data for ML model
As we know that for passing the features to the pipeline of machine learning model we need to take care of following things to get the required result from ML model .
• Remove/Fill none records of features
• Convert discrete features to numeric representation
We already worked on preparation of data in terms of filling missing records of features and drop the features gives us no information regarding the target outcome. But still we need to convert these discrete features into numeric representation, for this problem we will use one-hot encoding strategy which gives us numeric representation of each unique value in discrete features.
We uses LabelEncoder module from the pre-processing draft of scikit-learn module in python.
# Apply Encoding scheme on Categorical Variables to convert into Integer
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X2.apply(le.fit_transform)
After converting the discrete variables into numeric representation we will now separate our target column from original dataset and other features, so that we can pass these instances to our ML model for required results.
6- Machine learning for Data Analytics
Till now we visualize and analyze the data having seven(7) features with 4679 records in it. From target outcome and data description we came to know that this is binary classification problem and after the visualization of target outcome we came to know that, almost 88 % records in dataset belongs to “Y” as a target, so it shows us that we will be working in the problem of Imbalanced binary classification problem.
Since this is an classification problems, we have many algorithms and techniques that can be useful in classification problems, e.g
• K- nearest neighbors
• Decision Tree
• Random Forest
• Neural Networks
• Support Vector Machines
• Logistic Regression
• Naive Bayes
6.1 Justification of choosing ML model
All of these classifiers works excellent on specific type of dataset. But few of them would work excellent on binary classification problem and to be specific imbalance binary classification problems. Because of the binary loss function property of logistic regression, this classifier will work best on binary classification problem. Similarly because of regularization and boundary drawing capability with assigning the penalty on wrong predictions, support vector classifier would also fall in our category of learning and imbalance classification problem. And finally if dataset is quite explanatory with enough amount of records to learn in it, decision tree would also give us amazing results on binary classification problems.
Selected machine learning model
On the basis of justification given above we choose following machine learning classifiers for our problem of imbalance binary classification.
• Decision Tree Classifier
• Support Vector Classifier
• Logistic Regression
6.2 Data Splitting
Before starting the training phase of any of the above machine learning model we first need to split our data into training and test set. We will train data on training set and then validate data on test set to verify the results/outcomes of above mentioned machine learning model. So we have total of 4679 records in dataset and we used train_test_split module of scikit learn to split data in the ratio of 75% to 25% in train and test respectively with the following command.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
After splitting data into train and test set we got following shapes of training and test sets.
Shape of Train X: (3576, 7)
Shape of Train Y: (3576,)
Shape of Test X: (1193, 7)
Shape of Test Y: (1193,)
6.2 Evaluation methods and Justification
Since we are working on a classification problem, one of the most important and common evaluation metric of classification is accuracy. But because of the nature of problem having imbalance binary classification, accuracy could be a very misleading criteria of evaluation for any machine learning model. So along with the accuracy we will evaluate our model on following state of the art techniques of evaluating an imbalance classification problems.
• F-measure (F1-Score)
• Precision
• Recall
While building any machine learning model, the first thing that comes to our mind is how we can build an accurate & ‘good fit’ model and what the challenges are that will come during the entire procedure. Precision and Recall are the two most important but confusing concepts in Machine Learning. Precision and recall are performance metrics used for pattern recognition and classification in machine learning. These concepts are essential to build a perfect machine learning model which gives more precise and accurate results. Some of the models in machine learning require more precision and some model requires more recall. So, it is important to know the balance between Precision and recall or, simply, precision-recall trade-off
Precision : Precision is the number of true positive results divided by the number of all positive results, including those not identified correctly.
Recall : Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
F1-Score: In statistical analysis of binary classification, the F-score or F-measure is a measure of a test’s accuracy. It is calculated from the precision and recall of the test.
6.3 Training a Logistic Regression
We train the logistic regression classifier on training set of 3576 records with the help of built-in module of scikit learn. We import logistic regression from sklearn and after applying cross validation scheme on this classifier we figure out that following hyper-parameters would give us the best results on given data set.
• Random State : 0
• penalty : ‘l2’
• C : 1.5
• max iteration: 200
clf = LogisticRegression(random_state=0,penalty=’l2',C=1.5,max_iter=200).fit(X_train, y_train).
Then we validate the trained classifier on test set by predicting the target outcome of test set we split earlier. Now as we mentioned above for the imbalance classification problem we evaluate the classifier not only on accuracy which could be misleading but also on f-measure, precision and recall of model. We find all these evaluate methods from the sklearn module of evaluation metrics with the following functions.
• accuracy_score(y_test, pred)
• f1_score(y_test, pred,average=’binary’)
• precision_score(y_test, pred ,pos_label=’Y’,average=’binary’)
• recall_score(y_test, pred, average=”binary”, pos_label=”Y”)
After validating the trained classifier we got following results from it.
Methods Score
Accuracy 0.874
F1- Score 0.932
Precision 0.874
Recall 1.0
6.4 Training a Support Vector Classifier
We train the SVM classifier on training set of 3576 records with the help of built-in module of scikit learn. We import support vector classifier SVC from sklearn and after applying cross validation scheme on this classifier we figure out that following hyper-parameters would give us the best results on given data set.
• gamma : 0
• kernel : ‘rbf’
• C : 1.0
• degree: 3
We first create a pipeline for support vector classifier having an auto regularization as a parameter and then train/fit model on training dataset
# Standard Scalling of Data, With Auto Regulizer
clf = make_pipeline(StandardScaler(), SVC(gamma=’auto’))
clf.fit(X_train, y_train)
Then we validate the trained classifier on test set by predicting the target outcome of test set we split earlier. Now as we mentioned above for the imbalance classification problem we evaluate the classifier not only on accuracy which could be misleading but also on f-measure, precision and recall of model. We find all these evaluate methods from the sklearn module of evaluation metrics with the following functions.
• accuracy_score(y_test, pred)
• f1_score(y_test, pred,average=’binary’)
• precision_score(y_test, pred ,pos_label=’Y’,average=’binary’)
• recall_score(y_test, pred, average=”binary”, pos_label=”Y”)
After validating the trained classifier we got following results from it
Methods Score
Accuracy 0.927913
F1- Score 0.960222
Precision 0.927614
Recall 0.995206
6.5 Training a Decision Tree Classifier
We train the Decision tree classifier on training set of 3576 records with the help of built-in module of scikit learn. We import tree module from sklearn and choose decision tree classifier as optional tree and after applying cross validation scheme on this classifier we figure out that following hyper-parameters would give us the best results on given data set.
• Criterion : ‘Entropy’ / ‘Gini’
• min_impurity_decrease: 0.1
• Max depth : 3
We first create a pipeline for support vector classifier having an auto regularization as a parameter and then train/fit model on training dataset
clf = tree.DecisionTreeClassifier(criterion=’gini’,max_depth=3,min_impurity_decrease=0.1)
clf.fit(X_train,y_train)
Then we validate the trained classifier on test set by predicting the target outcome of test set we split earlier. Now as we mentioned above for the imbalance classification problem we evaluate the classifier not only on accuracy which could be misleading but also on f-measure, precision and recall of model. We find all these evaluate methods from the sklearn module of evaluation metrics with the following functions.
• accuracy_score(y_test, pred)
• f1_score(y_test, pred,average=’binary’)
• precision_score(y_test, pred ,pos_label=’Y’,average=’binary’)
• recall_score(y_test, pred, average=”binary”, pos_label=”Y”)
After validating the trained classifier we got following results from it
Methods Score
Accuracy 0.874
F1- Score 0.932
Precision 0.874
Recall 1.0
6.5 Comparison and results of Models
As till now we have trained three different classifiers on our dataset of roads and streets of DCC and for the sake of initial comparison, table shown below gives us the comparison of these three modules on state of the art techniques of model evaluation.
Methods Decision Tree SVM LogisticRegression
Accuracy 0.874 0.927913 0.874
F1- Score 0.932 0.960222 0.932
Precision 0.874 0.927614 0.874
Recall 1.0 0.995206 1.0
From the results mentioned above in the table we can clearly see that support vector classifier outperforms both of other classifiers (decision tree and logistic regression) that also gave us good results of 93% f1-score, but support vector classifier gives us 96% f1-score on given dataset.
6.5.1 Similarities in Trained models
From the results table we can clearly see that logistic regression and decision tree classifiers both give us same results on each evaluation method at their peak training with tuned hyper-parameter and these classifiers standardize the recall measure which is excellent for the solution of our problem if imbalance binary classification .
6.5.2 Differences in Trained Models
As we can see that logistic regression and decision tree classifiers both give us same results on each evaluation method but on the other hand support vector classifier gives us better measurement of results on each evaluation method from other two classifiers and especially on the measurement of f1-score which is one of the most accurate evaluation method on this type of imbalance binary classification dataset. So support vector classifier shoes us the clear difference of results from other classifiers because of its regularization and its margin penalty feature.